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1.  Objective 


This  DoD-EPSCoR  research  proposed  to  develop,  design  and  evaluate  an  on-board  intelligent 
health  assessment  tool  for  Air  Force  vibration  monitoring  applications.  The  system  developed  is 
capable  of  promptly  detecting,  correctly  identifying,  properly  accommodating  and  reliably 
predicting  the  gradual  material  degradation  and  catastrophic  component  failures  of  Air  Force 
vibrating  structures  in  adverse  operating  environments.  In  addition,  the  system  is  equipped  with 
the  ability  to  reason  the  temporal  cause-and-effect  relationship  from  raw  data  for  the  purpose  of 
predictive  maintenance.  In  a  similar  spirit,  the  technology  developed  is  spun-off  to  the 

monitoring  and  control  of  chemical  processes,  semiconductor  equipment  and  environmental 
healthiness. 

Adaptive  variation  of  Wavelet  packet  transform  is  designed  to  extract  necessary  time- 
frequency  signatures  from  analytically  redundant  sensor  channels  (Appendix  A).  To  validate  the 
healthiness  of  a  given  vibration  sensor,  neural  network  based  observer  is  used  to  estimate  critical 
sensor  measurements  when  neighboring  sensor  readings  are  collected  and  collated  (Appendix  B). 
With  the  aid  of  statistical  based  feature  selection  criteria,  many  of  the  feature  components 
containing  little  discriminant  information  have  been  discarded  resulting  in  a  feature  subset 
having  a  reduced  number  of  parameters  without  compromising  the  classification  performance 
(Appendix  C).  The  extracted  reduced  dimensional  feature  vector  is  then  used  as  input  to  a  pattern 
classifier.  A  hybrid  neural/fuzzy  network  with  an  on-line  real-time  learning  algorithm  is  then 
developed  to  perform  intelligent  decision  making  (Appendix  D).  To  provide  the  functionality  for 
predictive  maintenance,  knowledge  representation  and  extraction  is  made  possible  through  a 
fuzzy  based  reasoning  (Appendix  E).  Additionally,  two  robust  control  laws  based  upon  sliding 
mode  variable  structures  and  discrete-time  Lyapunov  stability  theory  are  proposed  to  provide 
fault  tolerance  with  guaranteed  global  stability  and  performance  (Appendix  F).  To  do  so,  a  multi¬ 
model  based  nonlinear  system  identification  based  on  Laguerre  filters  is  suggested  to  formulate 
the  changing  dynamic  during  the  evolution  of  failures  (Appendix  G).  A  hierarchical  architecture 
that  combines  a  high  degree  of  reconfigurability  and  long-term  memory  is  then  proposed  as  a 
fault  tolerant  control  algorithm  for  complex  nonlinear  systems  (Appendix  H).  Dual  Heuristic 
Programming  is  used  for  adapting  to  faults  as  they  occur  for  the  first  time  in  an  effort  to  prevent 
the  build  up  of  a  general  failure,  and  also  as  a  tuning  device  after  switching  to  a  known  scenario 
A  dynamical  database,  initialized  with  as  much  information  of  the  plant  as  available,  oversees  the 
DHP  controller.  The  decisions  of  which  models  to  record,  when  to  intervene  and  where  to  switch 
are  autonomously  taken  based  on  specifically  designed  quality  indexes.  The  problem  formulation 
has  resulted  into  a  multiobjective  optimization  problem  where  a  uniformly  distributed,  near 
optimal  and  near  complete  Pareto  front  is  sought  for  (Appendix  I)  in  the  feedback  loop  of  design 
procedure. 


The  resulted  system  with  all  needed  components  is  then  advocated  to  fulfill  the  time-critical 
and  on-board  needs  in  different  levels  of  structural  integrity  over  a  global  working  envelope.  The 
research  dedicated  to  Air  Force  utilization  is  not  only  focused  on  mathematical  treatments  of  the 
developed  fault  detection,  identification  and  accommodation  systems,  but  more  importantly 
promote  an  ultimate  enabling  tool  appropriate  for  on-board  health  decision  making  and  adaptive 
control. 
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Use  the  technology  developed,  automatic  recognition  of  frog  vocalization  is  developed  as  a 
valuable  tool  for  a  variety  of  biological  research  and  environmental  monitoring  applications 
(Appendix  J).  The  simulation  results  show  the  promising  future  of  deploying  an  array  of 
continuous,  on-line  environmental  monitoring  systems  based  upon  non-intrusive  analysis  of 
animal  calls.  In  a  similar  spirit,  we  also  propose  a  cost-effective  and  computation-efficient 
acoustic  emission  detection  system  combined  with  artificial  neural  network  technology  to 
recognize  four  major  flow  patterns  in  an  air-water  vertical  two-phase  vertical  column  (Appendix 
K).  This  technology  is  considered  very  crucial  in  most  process,  petrochemical  and 
pharmaceutical  industry. 

.  \ 

The  research  objective  is  to  demonstrate  the  feasibility  and  applicability  of  the  proposed 
health  monitoring  procedures  and  reconfigurable  control  laws  through  mathematical  analyses, 
numerical  simulations  and  experimental  verifications  in  chosen  Air  Force  applications.  The 
potential  of  spin-off  applications  on  DoD  structures  (i.e.,  aeropropulsion  engine,  on-orbit  satellite 
and  reusable  launch  vehicle),  industrial  processes  and  environmental  condition  monitoring  is 
promising  and  under  pursuit.  The  research  goal  is  in  complement  with  building  educational 
infrastructure,  as  witnessed  by  course  development,  laboratory  institution,  seminar  organization 
and  student  training. 


2.  Status  of  Effort 

This  final  report  covers  the  entire  contracting  efforts  from  August  1,  1997  to  July  31,  2001 
(with  an  extended  year  at  no  cost  to  the  program).  The  progress  made  is  primarily  targeted  to 
establish  the  fundamental  bases  from  analytical  and  simulation  studies.  In  addition,  we  have 
moved  forward  to  validate  the  algorithms  developed  in  several  experimental  testbeds.  To 
consolidate  the  technical  efforts  dedicated  to  the  theoretical  developments  appropriate  for 
vibration  monitoring  in  Air  Force  structures  (e.g.,  aircraft,  rotorcraft  and  RLVs),  a 
comprehensive  “modular”  approach  is  taken.  The  tasks  as  outlined  include  Wavelet  feature 
extraction,  sensor  validation,  feature  selection,  fault  detection,  identification  and  classification, 
knowledge  representation,  multi-model  on-line  control  autonomy,  nonlinear  system 
identification,  adaptive  critic  fault  tolerant  control  and  multiobjective  optimization  design.  The 
approach  is  evolving  as  more  mathematical  analyses  are  accomplished  and  desired  specifications 
are  defined.  The  methodology  proposed  is  generic  and  applicable  to  a  wide  variety  of  industrial 
applications  (e.g.,  chemical  fludized  processes,  semiconductor  RIE  facility),  medical-assisted 
monitoring  (e.g.,  EKG,  intra-cranial  pressure  monitoring)  and  environmental  assessment  (e.g., 
frog/bird  call  monitoring).  The  vibration  data  set  used  for  preliminary  test  is  derived  from  U.S. 
Navy  CH-46E  Chinook  helicopters  available  in  public  domain,  commonly  known  as  the 
Westland  data  set.  The  data  set  (4-second  time  series)  is  available  in  public  domain  and  has  been 
used  for  benchmark  comparison  by  researchers  within  this  community.  The  data  set  is  clean  and 
complete  as  opposed  to  the  USAF  Academy  bearing  data  set,  which  is  highly  noisy  and  sparse. 

In  addition,  we  have  built  a  machinery  fault  simulator  based  on  SpectraQuest  motor  test  unit. 
This  design  allows  us  to  simulate  various  real-time  fault  scenarios,  including  faulted  bearing, 
shaft  misalignment  and  imbalance  of  rotating  inertia.  The  proof-of-the-concept  system  has  been 
tested  on  much  more  complicated  environments.  The  efforts  dedicated  to  Wavelet  packet  feature 
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extraction,  sensor  channel  validation,  statistical  feature  selection,  neural-fuzzy  fault 
classification,  sliding  mode  fault  tolerant  control,  multi-model  nonlinear  system  identification, 
adaptive  critic  control  and  knowledge  representation  have  been  successful.  The  applications  to 
Navy’s  Westland  data  and  SpectraQuest  machinery  fault  simulator  show  significant 
improvements  from  all  available  results  given  in  literature.  In  addition,  the  Principal  Investigator 
(PI)  has  continuously  outreached  to  non-DoD  community  to  spin-off  the  proven  technology.  The 
extension  of  using  human-like  sensory  channels  (i.e.,  acoustic  emission  sensor,  CCD  camera, 
omni-directional  microphone  and  ultrasonic  sensor)  to  monitor  the  chemical,  pharmaceutical  and 
petrochemical  processes  have  been  convincing.  The  extension  to  monitor  environmental 
condition  through  frog  or  bird  calls  is  in  particular  innovative.  We  continuously  exchange 
information  with  POs  at  OKC-ALC,  including  B-1B,  B-52,  E-3T  and  C/KC-135  units  to  transfer 
the  technology  originally  developed  for  rotorcraft  to  fix-wing  aging  aircraft.  One  contract  has 
been  finalized  to  develop  the  technology  needed  for  capturing  the  failure  modes  information 
during  the  complicated  maintenance  actions. 

The  principal  investigator  would  like  to  express  his  gratitude  to  Air  Force  Office  of  Scientific 
Research  (AFOSR)  and  to  program  managers.  Major  Brian  Sanders  and  Dr.  Daniel  Segalman  for 
their  generous  support  during  the  course  of  this  study.  Without  their  gracious  financial  and 
technical  support,  this  research  can  never  become  feasible. 


3.  Accomplishments 

3.1  Introduction 


Modem  engineering  technology  is  leading  to  increasingly  complex  Air  Force  vehicles  with 
ever  more  demanding  performance  criteria.  Imminent  needs  in  prolonging  service  life  and 
—  readmeSS  f°r  global  defense  challenges  call  for  an  even  higher  standard  in  structural 
reliability.  A  downsized  workforce,  a  declining  development  budget  and  the  desire  for  a  “better 
cheaper  and  smarter”  resolution  have  further  complicated  the  risk  decisions.  These  problems  are 
even  crucial  in  today’s  global  defense  industry.  Condition  Monitoring  has  long  been  recognized 
fs  at°PPnonty  m  toe  development  of  the  next  generation  aircraft  and  reusable  launch  vehicles 
by  BMDO.  However,  cuirently  used  diagnostic  systems  that  rely  primarily  on  ingenious  sensor 
innovations  or  healthy  redundant  sensor  placements  to  provide  early  warning  and  maintenance 
procedure  are  costly,  vulnerable,  labor-intensive  and  computationally  expensive  to  validate. 

In  one  extreme,  the  maintenance  and  sustainment  of  aging  capital-intensive  infrastructures 
demand  innovative  technology  in  condition-based  maintenance.  The  USAF  Aging  Aircraft  & 
Systems  Office  (ASC/AMA)  located  in  Wright  Patterson  AFB  has  the  direct  responsibility 
wit  in  the  Air  Force  components.  A  substantial  and  growing  portion  of  the  military 
transportation  systems  and  infrastructure  were  built  more  than  40  years  ago,  and  are  now 
approaching  or  have  exceeded  the  original  design  lifetime.  An  outstanding  example  of  an  aging 
aircraft  is  the  U.S.  Air  Force’s  C/KC-135  airborne  tanker  fleet  that  is  approximately  40  years  of 
age  and  has  been  recently  redesigned  to  remain  in  service  at  least  through  the  year  2030  In  a 
similar  fate,  MH-53J  PAVE  LOW  helicopters  at  the  U.S.  Air  Force  Special  Operations 
Command  are  fast  approaching  its  destined  lifetime  and  need  to  be  on  duty  for  another  40  years. 
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A  1997  study  conducted  by  the  National  Research  Council  for  the  USAF  addressed  the 
technological  status,  needs  and  opportunities  facing  the  Air  Force’s  aging  aircraft  fleet.  Of 
particular  concern  is  the  C/KC-135  tanker.  Replacement  cost  for  the  C/KC-135  fleet  alone  is 
estimated  at  $40  billion,  and  this  represents  but  one  of  the  USAF  aging  systems.  The  OKlahoma 
City  Air  Logistic  Center  at  Tinker  AFB  (OKC-ALC)  is  served  as  the  primary  aging  aircraft 
maintenance  depot  for  the  USAF  fleets  of  bomber  (B-52  in  service  since  1961),  tanker  (C/KC- 
135  in  service  since  1958)  and  surveillance  aircraft  (E-3  AW  ACS  in  service  since  1977).  The  PI 
and  collaborative  researchers  in  the  State  of  Oklahoma  have  established  the  Oklahoma  Center  for 
Aging  Systems  and  Infrastructur  (OCASI)  in  providing  the  basic  research,  technology  transfer 
and  training  to  the  OKC-ALC  operations. 

The  OCASI  is  dedicated  to  developing  technologies  and  analysis  tools  for  predicting, 
extending  and  controlling  the  lives  of  systems  and  components  of  infrastructure  such  as 
airframes,  roads  and  bridges,  oil  fields  and  refinery  machinery,  general  physical  plant  and 
electrical/electronic  equipment.  As  a  founding  member  of  OCASI,  the  PI  has  actively  contributed 
in  meeting  the  technology  and  education  demands  in  facilitating  the  maintenance  operation 
exercised  in  the  OKC-ALC.  The  PI  has  discussed  the  potential  opportunity  with  several  POs  at 
OKC-ALC  to  apply  the  technology  developed  within  this  program  to  the  fixed-wing  aircraft.  In 
addition,  the  delegates  from  USAF  Aging  Aircraft  and  Systems  Office,  Boeing-Wichita, 
American  Airlines  and  others  are  involved  in  the  executive  panel  at  OCASI. 

In  the  most  recent  board  meeting  (Time  21,  2001),  the  PI  presented  the  research  undertaken 
herein  supported  by  the  AFOSR/NA  to  researchers  from  Hughes  and  Raytheon.  Future 
collaboration  is  under  discussion.  Significant  interests  have  been  received  from  the  Air  Force  end 
users  and  industrial  technology  developers.  Continuous  contacts  and  interactions  with  Air  Force 
program  offices  and  commercialization  partners  will  be  maintained  to  explore  the  future  research 
opportunities.  The  potential  and  commitment  of  follow-up  spin-off  research  from  DoD  end  users 
and  defense  industry  is  developing. 

This  final  report  documents  the  progress  made  throughout  the  entire  contracting  period 
dedicated  into  the  Wavelet  packet  feature  extraction,  sensor  channel  validation,  statistical  feature 
selection,  neural-fuzzy  classification,  sliding  mode  fault  tolerant  control,  multi-model  nonlinear 
system  identification,  adaptive  critic  dynamic  programming,  knowledge  representation  and 
multiobjective  optimization  design. 

As  shown  in  Figure  1,  a  generic,  structure  health  monitoring  approach  is  constantly  evolving. 
The  system  assumes  a  given  sensor  suite  will  act  as  an  on-line  health  usage  monitor  and  at  best 
provide  the  real-time  control  autonomy.  The  sensor  suite  can  incorporate  various  types  of 
sensory  devices,  from  vibration  accelerometers,  omni-directional  microphones,  computer  vision 
CCDs,  pressure  gauges  to  temperature  indicators.  The  decision  can  be  shown  in  a  visual  on¬ 
board  display  (for  pilot  or  for  ground  maintenance  crew)  or  fed  to  the  control  block  to  invoke 
controller  reconfiguration.  The  approach  has  been  continuously  refined  as  more  mathematical 
analyses  are  accomplished  and  desired  specifications  are  defined. 
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3.2  Wavelet  Packet  Feature  Extraction 


Condition  monitoring  of  dynamic  systems  based  on  vibration  signatures  has  generally  relied 
upon  Fourier  based  analysis  as  a  means  of  translating  vibration  signals  in  the  time  domain  into 
the  frequency  domain.  However,  Fourier  analysis  provided  a  poor  representation  of  signals  well 
localized  in  time.  In  this  case,  it  is  difficult  to  detect  and  identify  the  signal  pattern  from  the 
expansion  coefficients  because  the  information  is  diluted  across  the  whole  basis.  The  Wavelet 
Packet  Transform  (WPT)  is  introduced  as  an  alternative  means  of  extracting  time-frequency 
information  from  vibration  signature.  The  resulting  wavelet  packet  transform  coefficients 
provide  one  with  arbitrary  time-frequency  resolution  of  a  signal.  With  the  aid  of  statistical  based 
feature  selection  criteria,  many  of  the  feature  components  containing  little  discriminant 
information  could  be  discarded  resulting  in  a  feature  subset  having  a  reduced  number  of 
parameters  without  compromising  the  classification  performance.  The  extracted  reduced 
dimensional  feature  vector  is  then  used  as  input  to  a  neural  network  classifier.  This  has 
significantly  reduced  the  long  training  time  that  is  often  associated  with  the  neural  network 
classifier  and  improved  its  generalization  capability. 

This  research  has  investigated  the  feasibility  of  applying  the  wavelet  packet  transform  to  the 
classification  of  vibration  signals.  Using  the  wavelet  packet  transform,  a  rich  collection  of  time- 
frequency  characteristics  in  a  signal  can  be  obtained  and  examined  for  classification  purposes.  In 
this  study  we  detailed  an  innovative  feature  selection  process  that  exploits  signal  class 
differences  m  the  wavelet  packet  node  energy.  This  result  in  a  reduced  dimensional  feature  space 
compared  to  the  dimension  of  the  original  time  series  signal.  The  wavelet  packet  based  features 
obtained  by  our  method  for  vibration  signals,  yields  nearly  100%  correct  classification  when 
used  as  input  to  a  neural  network  classifier. 

Please  refer  to  Appendix  A  for  a  technical  report  published  in  the  IEEE  Transactions  on 
Industrial  Electronic . 

3.3  Sensor  Channel  Validation 

The  validation  of  data  from  sensors  has  become  an  important  part  in  the  operation  and 
control  of  modem  industrial  equipment.  To  validate  a  signal,  the  sensor  must  be  shown  to 
consistently  provide  the  correct  data,  and  the  analysis  of  the  validation  hardware  or  software 
should  provide  an  alarm  when  the  sensor  signal  deviates  from  its  nominal  value.  Neural  networks 
based  models  can  be  used  to  estimate  critical  sensor  values  when  neighboring  sensor 
measurements  are  used  as  inputs.  The  discrepancy  between  the  measured  and  predicted  sensor 
value  may  then  be  used  as  an  indication  of  sensor  health. 

A  methodology  for  estimating  sensor  values  and  detecting  sensor  failure  has  been  developed 
m  this  work.  The  method  allows  us  to  estimate  a  critical  sensor  data  when  other  sensors 
measurements  are  used  as  inputs.  An  auto-correlation  analysis  in  frequency  domain  was  used  to 
detect  the  sensor  failure.  The  network  is  a  synergetic  combination  of  fuzzy  logic  and  neural 
networks.  It  employs  the  fast  parallel  computation  and  learning  capability  of  neural  networks.  In 
addition,  fuzzy  set  theory  adds  the  ability  to  represent  and  manipulate  imprecise  information  The 
Winner-Take- All  (WTA)  Experts  Networks  consists  of  two  main  layers:  Fuzzy  membership 
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cluster  layer  and  MLP  experts  layer.  The  cluster  layer  employs  the  Gaussian  radial  basis  function 
as  a  fuzzy  membership  function.  The  general  idea  is  to  divide  a  complicated  problem  into  a 
series  of  sub-problems  and  assign  a  set  of  function  approximators  to  each  sub-problem.  A 
growing  fuzzy  membership  clustering  methods  was  used  to  divide  the  input  space  into 
overlapping  regions  on  which  'experts'  act.  After  the  WTA  Experts  Networks  were  trained  in 
sensor  nominal  state,  the  estimation  result  was  compared  with  real  sensor  outputs  in  failure 
modes.  The  difference  between  the  two  signals  in  frequency  domain  was  calculated.  The  auto¬ 
correlation  of  the  residual  is  analyzed  to  decide  whether  it  is  a  white  noise  or  not.  Additionally, 
the  jump  indicator,  mean  indicator  and  variance  indicator  helped  to  identify  sensor  failures  in 
time  domain.  The  results  both  from  frequency  and  time  domain  were  combined  together  covering 
the  eight  failure  modes  known  in  literature. 

Two  benchmark  data  set:  the  Spectra  Quest  Machinery  Fault  Simulator  data  set  and  the 
Westland  vibration  data  set  were  used  in  simulation  experiments  to  demonstrate  the  performance 
of  the  WTA  Experts  Networks.  Comparisons  between  the  WTA  Experts  Networks  and  the  other 
two  neural  networks  estimators  were  made.  The  results  show  that,  in  terms  of  estimation 
performance  (MSE),  the  WTA  is  competitive  with  or  even  better  than  the  MLP  networks  and 
RBF  networks  alone.  Furthermore,  the  auto-correlation  analysis  based  sensor  validation 
algorithm  was  used  to  investigate  eight  sensor  failure  modes  in  frequency  domain.  The  results 
from  the  simulation  studies  have  shown  that  the  validation  algorithm  is  efficient  for  detection 
seven  of  the  eight  faults,  except  the  'Spike'  mode.  With  the  help  of  indicators  in  time  domain,  the 
detection  algorithm  covered  all  eight  faults  at  last. 

Please  refer  to  Appendix  B  for  a  technical  report  published  in  the  ISA  Transactions. 

3.4  Statistical  Feature  Selection 

One  advantage  of  using  wavelet  packets  transform  to  decompose  a  signal  is  that  it  allows  us 
to  examine  different  time-frequency  resolution  components  in  a  signal.  Howevei,  direct 
manipulation  of  a  whole  set  of  node  energies  is  prohibitive  because  the  space  normally  has  very 
high  dimensionality,  and  the  existence  of  undesired  components  makes  the  classification 
unnecessarily  difficult.  In  the  training  of  a  neural  network  classifier,  it  is  desirable  to  use  a  lower 
dimensional  vector  as  input  to  the  neural  network  to  ease  the  design  of  the  classifier  and  improve 
its  generalization  capability.  One  popular  technique  in  reducing  the  feature  dimensionality  is  the 
Karhumen-Loeve  (K-L)  transform.  The  K-L  transform  is  optimal  for  “signal  representation ”  in 
the  sense  that  it  provides  the  smallest  mean  square  error  for  a  given  number  of  data.  However, 
the  features  defined  by  the  K-L  transforms  are  not  optimal  for  “ class  separability ”.  One 
transformation  associated  with  this  assumption  is  based  on  within  and  between  class  scatter 
matrices  that  are  used  in  linear  discriminant  analysis  of  statistics.  The  idea  is  to  find  a  linear 
transformation  that  projects  the  samples  onto  a  lower  dimensional  space  in  which  the  variability 
of  samples  within  each  class  is  as  close  as  possible,  and  the  dispersion  of  the  class  mean  vectors 
about  the  mean  vector  is  as  separated  as  possible.  In  such  a  case,  two  feature  selection  criteria 
based  on  measures  of  the  overlap  of  the  conditional  probability  density  function  among  different 
classes  was  proposed  to  avoid  the  possible  numerical  problem. 

Please  refer  to  Appendix  C  for  a  technical  report  published  in  the  ISA  Transactions. 
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3.5  Neural-Fuzzy  Fault  Classification 


An  innovative  neuro-fixzzy  system  (interconnecting  architecture  and  learning  rule) 
appropriate  for  fault  detection,  identification  and  classification  in  a  machinery  condition  health 
monitoring  environment,  called  an  “Incremental  Learning  Fuzzy  Neural”  (ILFN)  network  is 
proposed.  The  ILFN  classifier,  using  localized  neurons  to  represent  the  distributions  of  the  input 
space,  is  a  fast,  real-time,  one-pass,  on-line  and  incremental  learning  algorithm.  The  ILFN 
network  employs  a  hybrid  supervised  and  unsupervised  learning  scheme  to  generate  its 
prototypes.  The  network  is  a  self-organized  classifier  with  the  ability  to  adaptively  leam  new 
classes  of  failure  symptoms  without  forgetting  existing  knowledge.  The  classifier  can  detect  new 
classes  of  failure  modes  and  update  its  parameters  continuously  while  monitoring  a  system.  To 
demonstrate  the  feasibility  and  effectiveness  of  the  proposed  neuro-fuzzy  paradigm,  numerical 
simulations  have  been  performed  using  the  vibration  data  known  as  Westland  data  set  collected 
from  an  U.S.  Navy  CH-46E  helicopter  teststand.  Using  a  simple  fast  Fourier  transform  technique 
for  feature  extraction,  the  ILFN  network  capable  of  one-pass,  on-line  and  incremental  learning 
has  shown  promising  results.  With  various  torque  levels  for  training  the  network,  100%  correct 
classification  was  achieved  for  the  same  torque  levels  of  testing  data.  In  addition,  the 
classification  performance  of  the  network  has  been  tested  on  some  well-known  benchmark  data 
sets,  such  as  the  Fisher  s  Iris  data  and  the  Deterding  vowel  data  set.  For  the  generalization 
capability,  comparison  studies  with  other  well-known  classifiers  were  performed  and  the  ILFN 
classifier  was  found  competitive  with  or  even  superior  to  many  existing  classifiers. 

Please  refer  to  Appendix  D  for  a  technical  report  published  in  the  IEEE  Transactions  on 
Systems,  Man  and  Cybernetics,  Part  B:  Cybernetics. 

3.6  Fuzzy  Knowledge  Representation 


Among  all  well-known  failure  modes,  an  experienced  expert  can  usually  tell  that  an  erratic 
frequency  response  of  a  bearing  sensor  may  indicate  the  degradation  of  the  transmission  gearbox. 
An  experienced  operator  in  sulfunc  acid  treatment  of  phosphate  rock  may  observe  froth  color  or 
bubble  character  to  control  process  material  in-flow.  The  capability  of  incorporating  this  abstract 
expert  knowledge  into  a  leamable  artificial  neural  network  becomes  essential  in  solving  the 
FDIA  problem  effectively.  The  existing  feedforward  networks,  which  leam  and  generalize  all 
nonlinear  mappings  from  raw  data,  does  not  assure  such  a  mechanism. 

Knowledge  representation  is  the  ability  to  translate  knowledge  entailed  in  a  “black  box” 
neural  network  structure  into  linguistic  and/or  numeric  rules  that  are  accessible  to  human  system 
operators.  The  two  main  functions  of  an  extracted  rule  set  is  to  clearly  explain  knowledge 
provided  by  domain  experts  and  to  provide  a  means  for  human  operators  to  validate  the  operation 
of  the  neural  network.  Rules  provide  reasoning  and  explanation  capabilities  for  the  system.  The 
classification  output  of  the  neural  network  can  be  justified  in  linguistic  terms,  increasing  human 
understanding  of  the  logical  process  used  by  the  network  and  identifying  any  mis-classifications 
that  are  known  to  be  incorrect  by  domain  experts.  Classifications  made  by  the  neural  network  can 
be  verified  by  the  knowledge  base  by  examining  the  results  of  presented  input  patterns  to  the 
network  and  the  rule  base.  Linguistic  rules  also  help  alleviate  the  interference  problem  of 
artificial  neural  networks.  When  a  neural  network  is  trained  using  data  from  the  same  system  in 
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different  operating  environments,  a  set  of  rules  for  each  environment  may  be  extracted.  This 
simplifies  retraining,  and  allows  portability  of  the  network  and  its  rules  between  systems  and 
environments. 

In  this  study,  we  propose  a  novel  hybrid  intelligent  system  (HIS),  which  provides  a  unified 
integration  of  numerical  and  linguistic  knowledge  representations.  The  proposed  HIS  is  a 
hierarchical  integration  of  an  incremental  learning  fuzzy  neural  network  (ILFN)  and  a  linguistic 
model,  i.e.,  fuzzy  expert  system  (FES),  optimized  via  the  genetic  algorithm  (GA).  The  ILFN  is  a 
self-organizing  network  with  the  capability  of  fast,  one-pass,  online,  and  incremental  learning. 
The  linguistic  model  is  constructed  based  on  knowledge  embedded  in  the  trained  ILFN  or 
provided  by  the  domain  expert.  The  knowledge  captured  from  the  low-level  ILFN  can  be  mapped 
to  the  higher-level  linguistic  model  and  vice  versa.  The  GA  is  applied  to  optimize  the  linguistic 
model  to  maintain  high  accuracy,  comprehensibility,  completeness,  compactness,  and 
consistency.  The  resulted  HIS  is  capable  of  dealing  with  low-level  numerical  computation  and 
higher-level  linguistic  computation.  After  the  system  being  completely  constructed,  it  can 
incrementally  leam  new  information  in  both  numerical  and  linguistic  forms.  To  evaluate  the 
system’s  performance,  the  well-known  benchmark  Wisconsin  breast  cancer  data  set  was  studied 
for  an  application  to  medical  diagnosis.  The  simulation  results  have  shown  that  the  proposed  HIS 
perform  better  than  the  individual  standalone  systems.  The  comparison  results  show  that  the 
linguistic  rules  extracted  are  competitive  with  or  even  superior  to  some  well-known  methods. 

Please  refer  to  Appendix  E  for  a  technical  report  submitted  to  the  IEEE  Transactions  on 
Systems,  Man  and  Cybernetics,  Part  B:  Cybernetics. 

3.7  Sliding  Mode  Fault  Tolerant  Control 

As  dynamic  systems  become  more  complex,  experience  more  rapidly  changing 
environments,  and  encounter  a  greater  variety  of  unexpected  component  failures,  solving  the 
control  problems  of  such  systems  is  a  grand  challenge  for  control  engineers.  Traditional  control 
design  techniques  are  not  adequate  to  cope  with  these  systems,  which  may  suffer  from 
unanticipated  dynamic  failures.  In  this  research  work,  we  investigate  the  fault  tolerant  control 
problem,  the  current  existing  intelligent  control  techniques  using  artificial  neural  networks,  and 
propose  an  intelligent  control  strategy  to  handle  the  desired  trajectories  tracking  problem  for 
systems  suffering  from  catastrophic  faults  or  incipient  failures.  The  approach  is  to  continuously 
monitor  the  system  performance  and  identify  what  the  system's  current  state  is  by  using  a  fault 
detection  method  based  upon  our  best  knowledge  of  the  nominal  system  and  nominal  controller 
Once  a  fault  is  detected,  the  proposed  intelligent  controller  will  adjust  its  control  signal  by 
adding  a  robust  term  (i.e.,  by  switching  to  a  sliding  mode  controller)  to  confine  the  system 
performance  within  a  boundary  layer.  At  the  same  time,  an  artificial  neural  network  is  initialized 
and  compensates  for  the  unknown  fault  dynamics  on-line.  Once  the  on-line  learning  process 
converges,  the  control  input  is  tuned  again  by  using  the  output  of  the  identification  model  and  a 
new  least  upper  bound  for  the  remaining  uncertainty  to  further  reduce  the  tracking  error.  The 
simulation  results  show  a  significant  improvement  in  trajectory  following  performance  based 
upon  the  proposed  intelligent  sliding  mode  controller. 
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We  investigated  fault  tolerant  control  problems  and  proposed  an  intelligent  sliding  mode 
control  strategy  to  deal  with  a  specific  fault  tolerant  control  problem  based  upon  the  discrete-time 
sliding  mode  control  technique  and  neural  network  on-line  approximator  that  is  capable  of  self¬ 
optimization,  on-line  adaptation,  autonomous  fault  detection  and  controller  reconfiguration. 

We,  first,  suggest  using  neural  network  techniques  to  improve  the  accuracy  of  the  nominal 
model  and  the  nominal  controller.  Then,  the  abnormal  system  behavior  due  to  system  component 
failures  can  be  monitored  and  identified.  Once  system  faults  are  detected,  the  control  law  is 
reconfigured  by  using  sliding  mode  control  technique  together  with  the  help  of  a  neural  network 
to  recover  the  system  performance.  The  neural  network  is  used  to  learn  the  unknown  failure 
dynamics  on-line  and  to  estimate  the  least  upper  bound  of  the  remaining  uncertainty.  The 
resulting  intelligent  control  law  shows  promising  performance.  The  simulation  results  indicate  a 
significant  performance  improvement  based  upon  the  proposed  intelligent  control  strategy  even 
when  the  system  becomes  unstable,  due  to  multiple  unexpected  component  failures.  Further 
research  work  will  focus  on  developing  an  intelligent  control  methodology  for  more  general 
failure  cases  where  both  the  nominal  model  and  the  unknown  failure  dynamics  are  general 
nonlinear  functions.  When  the  nominal  model  is  not  in  "affine  in  control"  format  or  it  is  not 
readily  available  to  obtain  the  mathematical  representation,  we  may  use  neural  networks  to 
replace  it.  Under  this  structure,  the  problems  of  learning  the  unanticipated  failure  mode  dynamics 
on-line  and  the  reconfiguration  of  the  control  actions  are  much  more  difficult  to  solve  in  a  real¬ 
time  fashion. 

Please  refer  to  Appendix  F  for  a  technical  report  to  be  published  in  the  International  Journal 
of  Control. 

3.8  Multi-Model  Nonlinear  System  Identification 

Recently,  model-free  or  data-driven  control  has  been  gaining  a  great  interest  to  overcome  the 
limitations  of  the  conventional  model  based  control  methodologies.  However,  the  existing  data- 
driven  control  is  far  from  practical  because  of  its  slow  convergence,  severe  computational  burden 
and  lack  of  analysis  and  synthesis  tools.  This  paper  was  motivated  on  response  to  these  issues  of 

data-driven  control  methodologies  by  the  multiple  model  approach. 

First,  a  feasibility  check  of  the  data-driven  approach  is  done.  Since  we  do  not  make  any 
assumptions  about  the  system,  it  is  important  to  verify  that  the  unknown  system  can  be  identified 
only  with  sampled  input-output  data.  After  the  discussion  about  the  embedding  theorem, 
literature  is  reviewed  regarding  nonlinear  sampled  data  system  identification,  multiple  modeling 
and  multiple  model  based  control.  Literature  review  reveals  that  the  multiple  model  approach  is 
quite  promising,  however,  it  still  lacks  systematic  tools  for  analysis  and  synthesis.  A  new 
approach  based  on  orthonormal  bases  is  proposed  to  maintain  the  tractability  while  keeping  the 
advantages  of  multiple  models.  Simulation  study  is  included  to  verify  the  existing  as  well  as  new 
algorithms.  Conclusions  and  new  suggestions  about  improvement  are  followed. 

This  paper  was  motivated  by  the  ambition  to  realize  a  practical  data-driven  control  system 
comparable  to  conventional  model  based  methods.  The  main  theme  was  to  adopt  a  multiple 
model  approach  instead  of  a  global  approach.  By  this  adoption,  we  can  relieve  the  computational 
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burden  significantly  while  also  maintain  the  mathematical  tractability.  Also,  this  approach 
enables  us  to  take  advantage  of  the  existing  methods  such  as  linear  system  identification  and 
many  estimation  techniques.  The  proposed  algorithm  takes  advantage  of  the  ice  property  of 
orthonormal  basis  functions.  By  this,  we  can  relieve  the  difficulty  of  estimating  the  order  of 
regression  vectors  and  also  achieve  efficient  training  with  maintained  mathematical  tractability. 
The  simulation  results  in  section  4  verified  the  algorithm.  Three  other  models  were  also 
considered:  linear  ARX  models,  feedforward  neural  networks  based  ARX  models  and  SOM 
based  multiple  models.  The  simulation  was  done  with  Matlab  and  a  small  toolbox  was  written 
for  this  study.  The  proposed  model  needs  more  refinement  and  analysis.  One  issue  is  to  select  the 
proper  weights.  One  possible  idea  is  to  utilize  the  correlation  since  the  local  models  are  linear 
systems  and  we  assume  that  the  nonlinear  system  is  locally  linear.  Another  issue  is  locating  the 
optimal  poles.  As  mentioned  before,  there  are  many  advantages  if  we  can  make  the  Laguerre 
bases  orthonormal  to  each  other.  Also,  the  pole  estimation  method  based  on  the  iterative 
optimization  method  has  limitations  in  on-line  implementation. 

Please  refer  to  Appendix  G  for  a  technical  report  to  be  published  in  the  International  Journal 
of  Control. 

3.9  Adaptive  Critic  Dynamic  Programming 

As  complex  systems  suffer  from  faults,  the  original  model  parameters,  or  even  their  own 
dynamic  structure,  may  change  in  a  multitude  of  unpredictable  ways.  Even  if  the  system  has  a 
satisfactory  linearization  around  the  nominal  operating  point,  nonlinearities  may  become  of 
paramount  importance  after  a  fault  occurs.  Since  complex  systems  pose  a  challenge  even  in  the 
design  of  models  under  nominal  conditions,  the  task  of  off-line  devising  nonlinear  high  order 
models  for  all  known  fault  scenarios  can  be  a  daunting  one.  When  the  stochastic  nature  of  faults 
is  taken  into  consideration,  and  to  even  possess  knowledge  of  all  fault  scenarios  is  made 
impossible,  it  becomes  clear  to  see  that  the  problem  of  interest  to  FTC  cannot  be  dealt  with 
without  on-line  nonlinear  adaptive  control  strategies.  In  the  proposed  architecture,  Dual  Heuristic 
Programming  (DHP),  an  Adaptive  Critic  Design  (ACD),  was  chosen  as  the  reconfigurable 
controller  due  to  its  known  effectiveness  to  work  in  noisy,  nonlinear  environments  while  making 
minimal  assumptions  regarding  the  nature  of  that  environment. 

To  our  best  knowledge,  the  application  of  the  DHP  reconfigurable  controller  represents  one 
of  the  most  effective  ways  to  deal  with  the  unexpected  dynamics  that  a  plant  may  assume  after 
the  occurrence  of  a  fault.  However,  as  a  FTC  scheme  by  itself,  the  use  of  a  reconfigurable 
controller  such  as  DHP  presents  two  main  limitations.  The  first  one  arises  from  the  fact  that 
solutions  to  a  set  of  expected  fault  scenarios  are  often  available  and  may  involve  the  application 
of  very  specific  control  laws.  A  reconfigurable  controller  alone  however,  does  not  provide  any 
mechanism  through  which  knowledge  available  during  design  time  can  be  incorporated.  The 
second  limitation  arises  from  the  known  tradeoff  between  adaptation  and  long-term  memory.  As 
the  reconfigurable  controller  provides  faster  convergence  to  a  wider  range  of  control  solutions,  it 
fails  to  retain  the  knowledge  of  the  control  laws  designed  for  previously  visited  scenarios. 

To  overcome  both  limitations,  a  novel  supervisor  system  oversees  the  DHP  controller  in  the 
architecture.  The  Identifier  and  Controller  Dynamical  Database,  located  inside  the  supervisor 


12 


contains  the  knowledge  available  during  design  time,  as  well  as  solutions  devised  online  for 
unexpected  fault  scenarios.  The  decisions  of  when  to  intervene  by  switching  to  a  known  control 
solution  and  when  to  add  a  new  identifier  and  controller  pair  to  the  database  are  taken  by  the 
supervisor  based  on  the  current  fault  scenario.  Such  information  is  extracted  by  the  scenario 
recognition  module,  which  makes  use  of  specifically  designed  quality  indexes  capable,  not  only 
of  performing  Fault  Detection  and  Identification,  but  also  to  produce  indispensable  information 
on  the  evolution  of  a  fault  through  time.  The  synergetic  combination  of  the  superior  adaptation 
capabilities  of  the  DHP  controller  with  the  fault  information  and  long-term  memory  provided  by 
the  multiple  model  structure  of  the  proposed  supervisor  generates  an  advanced  FTC  scheme 
capable  to  deal  with  a  diversified  collection  of  actuator  and  component  faults. 

Please  refer  to  Appendix  H  for  a  technical  report  submitted  to  the  IEEE  Transactions  on 
Neural  Networks. 

3.10  Multiobjective  Optimization 

Since  the  1980’s,  the  application  of  Evolutionary  Algorithms  (EA’s)  in  solving 
Multiobjective  Optimization  Problems  (MOPs)  has  been  receiving  a  growing  interest  from 
evolutionary  computation  community.  To  search  for  a  family  of  “acceptable”  solutions,  a  so 
called  Pareto  set,  by  using  EA’s  population-based  parallel  searching  ability,  several 
Multi  Objective  Evolutionary  Algorithms  (MOEAs)  have  been  proposed.  However,  most  of  these 
MOEAs  have  difficulty  in  dealing  with  the  trade-off  between  uniformly  distributing  the 
computational  resources  and  finding  the  near-complete  and  near-optimal  Pareto  set.  On  the  other 
hand,  according  to  the  No  Free  Lunch  theorems,  no  formal  assurance  of  an  algorithm’s  general 
effectiveness  exists  if  insufficient  knowledge  of  the  problem  characteristics  is  incorporated  into 
the  algorithm  domain. 

In  this  study,  the  PI  and  his  student  propose  a  new  evolutionary  approach  to  multiobjective 
optimization  problems,  the  Rank-Density  based  Genetic  Algorithm  (RDGA)  that  synergistically 
integrates  selected  features  from  existing  MOEAs  in  a  unique  way.  A  new  ranking  method, 
automatic  accumulated  ranking  strategy,  and  a  “forbidden  region”  concept  are  introduced, 
completed  by  a  revised  adaptive  cell  density  evaluation  scheme  and  a  rank-density  based  fitness 
assignment  technique.  In  addition,  four  types  of  MOP  features,  such  as  discontinuous  and 
concave  Pareto  front,  local  optimality,  high-dimensional  decision  space  and  high-dimensional 
objective  space  are  exploited  and  the  corresponding  MOP  test  functions  are  designed.  By 
examining  the  selected  performance  indicators,  RDGA  is  found  to  be  statistically  competitive 
with  four  state-of-the-art  MOEAs  in  terms  of  keeping  the  diversity  of  the  individuals  along  the 
trade-off  surface,  tending  to  extend  the  Pareto  front  to  new  areas  and  finding  a  well- 
approximated  Pareto  optimal  front. 

For  the  MOP  test  functions  that  only  possess  discontinuous  or  concave  Pareto  fronts,  the 
recent  developed  approaches — RDGA,  NSGA-II,  PAES  and  SPEA-II  do  not  have  much  trouble 
in  finding  some  points  of  the  true  Pareto  front,  and  RDGA  is  found  to  show  better  performance 
in  keeping  the  diversity  of  the  individuals  along  the  current  trade-off  surface,  extending  the 
Pareto  front  to  new  areas,  and  finding  a  well-approximated,  non-dominated  set.  However, 
without  cautious  selection  of  an  initial  population,  an  MOP  with  a  feature  of  local  optimality  will 
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easily  cause  most  MOEAs  problems  in  finding  a  pure  global  Pareto  front.  A  local  Pareto  front 
created  by  constraints  may  produce  less  difficulty  than  what  is  generated  by  objective  functions 
if  the  former  one  does  not  contain  pseudo-global  Pareto  fronts.  In  addition,  two  complicated 
MOP  test  functions  with  high-dimensional  decision  space  and  objective  space  are  examined  by 
RDGA  and  the  selected  MOEAs.  The  experimental  results  demonstrate  that  RDGA  produces 
statistically  competitive  results  with  other  representative  Pareto-based  MOEAs  in  finding  a  near- 
optimal,  near-complete  and  uniformly  distributed  Pareto  front.  Furthermore,  as  the  test  functions 
used  in  this  study  are  still  far  from  embodying  a  complete  MOP  test  suite,  a  more  profound  study 
in  developing  a  general  model  of  MOEA  and  designing  a  more  representative  test  function  set  in 
this  field  is  absolutely  necessary  in  future  work. 

Please  refer  to  Appendix  I  for  a  technical  report  submitted  to  the  IEEE  Transactions  on 
Evolutionary  Computations. 

3.11  Frog  Call  Monitoring 

Recently  there  is  an  increasing  interest  and  expenditure  in  environmental  monitoring,  both  in 
North  America  and  around  the  world.  It  is  becoming  essential  to  predict  and  assess  the 
environmental  impact  of  human  activities  on  plants  and  animals.  The  populations  of  certain 
kinds  of  animals  like  birds  and  frogs  are  excellent  indicators  of  overall  environmental  health.  As 
many  of  the  animals  in  an  area  may  be  heard  but  not  seen,  it  is  convenient  to  rely  on  their  sounds 
as  a  means  of  identification.  In  many  places  manual  census  is  not  feasible,  if  not  completely 
impossible.  As  a  result,  automatic  recognition  of  animal  sounds  is  considered  a  valuable  tool  for 
biological  research  and  environmental  monitoring  applications. 

In  this  research  an  automatic  monitoring  system,  which  can  recognize  the  vocalizations  of 
four  popular  species  of  frogs  and  can  identify  different  individuals  within  the  species  of  interest, 
is  proposed.  For  the  desired  monitoring  system,  species  identification  is  performed  first  with  the 
proposed  filtering  and  grouping  algorithm.  Individual  identification,  which  can  estimate  frog 
population  within  the  specific  species,  is  performed  in  the  second  stage.  Digital  signal  pre¬ 
processing,  feature  extraction,  dimensionality  reduction,  and  neural  network  pattern  classification 
are  performed  step  by  step  in  this  stage.  Wavelet  Packet  feature  extraction  together  with  two 
different  dimension  reduction  algorithms  are  synergistically  integrated  to  produce  final  feature 
vectors,  which  are  to  be  fed  into  a  neural  network  classifier.  The  simulation  results  show  the 
promising  future  of  deploying  an  array  of  continuous,  on-line  environmental  monitoring  systems 
based  upon  non-intrusive  analysis  of  animal  calls. 

Please  refer  to  Appendix  J  for  a  technical  report  published  in  the  International  Journal  of 
Computational  Intelligence  and  Applications. 

3.12  Acoustic  Monitoring  for  Process  Flow 

Gas-liquid  two-phase  flows  are  widely  used  in  the  chemical  industry.  Accurate  measurements 
of  flow  parameters,  such  as  flow  regimes,  are  the  key  of  operating  efficiency.  Due  to  the 
interface  complexity  of  a  two-phase  flow,  it  is  very  difficult  to  monitor  and  distinguish  flow 
regimes  on-line  and  real-time.  In  this  paper  we  propose  a  cost-effective  and  computation- 
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efficient  AE  detection  system  combined  with  artificial  neural  network  technology  to  recognize 
four  major  patterns  in  an  air-water  vertical  two-phase  flow  column.  Several  crucial  AE 
parameters  are  explored  and  validated,  and  we  found  that  the  density  of  acoustic  emission  events 
and  ring-down  counts  are  two  excellent  indicators  for  the  flow  pattern  recognition  problems. 
Instead  of  the  traditional  Fair  map,  a  hit-count  map  is  developed  and  a  multi-layer  Perceptron 
neural  network  is  designed  as  a  decision-maker  to  describe  an  approximate  transmission  stage  of 
a  given  two-phase  flow  system. 

Please  refer  to  Appendix  K  for  a  technical  report  to  appear  in  the  ISA  Transactions. 


4.  PERSONNEL  SUPPORT 

4.1  Principal  Investigator 

Gary  YEN,  Associate  Professor 

4.2  Graduate  Students 


Liang-Wei  HO,  Ph.D.  Fall  2000 
sliding  mode  fault  tolerant  control 
Nick  LEE,  Ph.D.  Fall  2001 

Laguerre-based  multi-model  system  identification 
Phayung  MEESAD,  Ph.D.  Fall  2001  (supported  by  Thai  Government) 
knowledge  representation  and  discovery 
Haiming  LU,  Ph.D.  candidate,  to  be  completed  in  Spring  2002 
multiobjective  optimization 

Pedro  de  LIMA,  Ph.D.  candidate,  to  be  completed  in  Spring  2003 
adaptive  critic  fault  tolerant  control 

Kuo-Chung  LIN,  M.S.  Fall  1998 

Wavelet  transform  feature  extraction 
Wei  FENG,  M.S.  Spring  2000 

sensor  data  validation  and  fusion 
Fengming  YANG,  M.S.  Spring  2000 
reinforcement  learning  control 
Qiang  FU,  M.S.  Fall  2000 

environmental  monitoring  using  frog  vocalization 

Please  refer  to  Appendix  L  for  the  overview  charts  associate  with  each  subtask  outlined 
above. 
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“On-line  fault  accommodation  control  for  catastrophic  system  failures,”  Ho  L.  and  Yen 
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S.,  submitted  to  International  Journal  of  Control  and  Intelligent  Systems. 
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Lee  S.  and  Yen  G.G.,  submitted  to  IEEE  Transactions  on  Fuzzy  Systems. 

•  “Combined  numerical  and  linguistic  knowledge  representations  for  medical  diagnosis,” 
Meesad  P.  and  Yen  G.G.,  submitted  to  IEEE  Transactions  on  Systems,  Man  and  Cybernetics, 
Part  B:  Cybernetics. 

•  “Hierarchical  rank-density  genetic  algorithm  for  radial  basis  function  neural  network 
design,”  Yen  G.G.  and  Lu  H.,  submitted  to  International  Journal  of  Computational 
Intelligence  and  Applications. 

•  “Quantitative  measure  on  the  accuracy,  comprehensibility,  and  completeness  of  a  fuzzy 
expert  system,”  Meesad  P.  and  Yen  G.G.,  submitted  to  IEEE  Transactions  on  Systems,  Man 
and  Cybernetics,  Part  B:  Cybernetics. 

•  “Dynamic  population  size  in  multiobjective  evolutionary  algorithm,”  Lu  H.  and  Yen  G.G., 
submitted  to  IEEE  Transactions  on  Evolutionary  Computations. 
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submitted  to  International  Journal  of  Computational  Intelligence  and  Applications. 
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5.2  Conference  Proceedings 

•  “Health  monitoring  on  vibration  signatures-  industrial  applications,”  Yen  G.G.,  30th  IEEE 
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Meesad  P.,  SPIE  International  Conference  on  Component  and  Systems  Diagnostics, 
Prognosis,  and  Health  Management,  April  16-17,  2001,  Orlando,  Florida. 
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•  “On-line  intelligent  fault  tolerant  control  for  catastrophic  system  failures,”  Yen  G.G.  and  Ho 
L.,  SPIE  International  Conference  on  Component  and  Systems  Diagnostics,  Prognosis,  and 
Health  Management,  April  16-17, 2001,  Orlando,  Florida. 
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•  “Coordination  of  exploration  and  exploitation  in  a  dynamic  environment,”  Yen  G.G.,  Yang 
F.,  Hickey  T.  and  Goldstein  M.,  2000  INNS/IEEE  International  Joint  Conference  on  Neural 
Networks,  July  14-19, 200 1 ,  Washington,  District  of  Columbia. 

•  “A  SOM  mapping  technique  for  visualizing  documents  in  a  database,”  Morris  S.,  Wu  Z.  and 
Yen  G.G.,  2000  INNS/IEEE  International  Joint  Conference  on  Neural  Networks,  July  14-19, 
2001,  Washington,  District  of  Columbia. 

•  “On-line  multiple-model  based  fault  diagnosis  and  accommodation,”  Yen  G.G.  and  Ho  L., 
IEEE  International  Symposium  on  Intelligent  Control,  September  5-7,  2001,  Mexico  City’ 
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•  “Multiobjective  optimization  design  via  genetic  algorihm ,”  Lu  H.  and  Yen  G.G.,  IEEE 
Conference  on  Control  Applications,  September  5-7,  2001,  Mexico  City,  Mexico. 

•  “Reconfigurable  control  system  design  for  fault  tolerance,”  Ho  L.  and  Yen  G.G.,  40th  IEEE 
Conference  on  Decision  and  Control,  December  4-7,  2001 ,  Orlando,  Florida. 


6.  Interactions/Transitions 

6.1  Oklahoma  City  Air  Logistic  Center.  Tinker  AFB.  OK 

The  PI  is  in  the  process  to  identify  the  problem  issues  that  may  be  benefit  from  the 
technology  developed  within  this  program  to  OKC-ALC.  The  PO’s  at  B-1B.  B-52,  C/KC-135 
and  E-3  are  in  the  list. 

6.2  IMC-Agrico.  Mulberry.  FL 
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The  IMC-Agrico  based  on  Mulberry,  FL  is  interested  in  the  sensory-based  process 
monitoring,  mainly  on  petroleum  products.  The  research  objective  is  to  survey  existing  best 
practices  in  sensory-based  process  monitoring  and  control. 

In  daily  practice,  the  experienced  operator  in  sulfuric  acid  treatment  of  phosphate  rock  may 
observe  froth  color  or  bubble  character  to  control  process  material  in  flow.  The  acoustic  sound  of 
cavitation  or  boiling/flashing  may  invoke  the  increase  or  decrease  of  material  flow  rates.  Smart 
in-situ  sensors,  including  proprietary  low-power  laser  telemeter,  surface  photo  voltage  map, 
power  spectrum  of  vibration  signatures,  ultrasonic/ultraviolet  waveguides,  machine  vision, 
infrared  telemetry  and  artificial  nose/tongue  have  facilitated  potential  mechanism  for  factory 
automation  with  promising  industry  applicability.  Based  on  the  findings  within  DEPSCoR 
program,  we  are  developing  process  specific  health  monitoring  and  control  system.  An  array  of 
error  sensing  microphones  and  a  continuous  stream  video  camera  will  be  employed  to  pick  up 
the  acoustic  sound  or  video  image  to  invoke  the  intelligent  decision  making  of  a  two-phase  air- 
water  flow  distillation  column. 

6.3  Oklahoma  Environmental  Institute,  Stillwater.  OK 

Automated  environmental  monitoring  based  solely  upon  frog  vocalizations  is  considered  a 
valuable  tool  for  a  variety  of  biological  research  and  environmental  monitoring  indicators.  We 
propose  to  develop  an  automated,  unattended,  environmental-harden,  monitoring  system,  which 
can  recognize  the  vocalization  of  interested  species  of  frogs  in  the  State  of  Oklahoma.  The 
proposed  monitoring  system  will  deploy  an  omni-directional  microphones  array  to  record  the 
frog  calls  in  the  field  continuously,  process  the  analog  voices  pick-ups  (including  background 
noise  cancellation,  signal  conditioning,  time-frequency  feature  extraction,  data  compression,  and 
spectrogram  signature  classification)  and  then  transmit  only  essential  information  over  Mesonet 
for  follow-up  environmental  decision  making.  The  successful  development  of  the  proposed  frog 
calls  monitoring  system  will  provide  a  robust  measurement  to  quantify  the  environmental  noise 
pollution.  This  proposal  has  been  well-received  by  the  Oklahoma  City  Zoo  and  North  American 
Amphibian  Society  to  monitor  the  amphibian  population  as  an  indicator  of  environmental  and 
water  quality. 

7.  Patent  Disclosures 

None 

8.  Honors/Awards 

Promoted  to  Associate  Professor  in 

2000  Oklahoma  State  University  Halliburton  Outstanding  Young  Faculty  Award 

Best  Presentation  at  2000  American  Control  Conference,  Chicago,  EL,  June  2000 
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For  Vibration  Monitoring 
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IEEE  Transactions  on  Industrial  Electronics,  47(3),  2000,  pp.  650-667 
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Abstract — Condition  monitoring  of  dynamic  systems  based  on 
vibration  signatures  has  generally  relied  upon  Fourier-based  anal¬ 
ysis  as  a  means  of  translating  vibration  signals  in  the  time  domain 
into  the  frequency  domain.  However.  Fourier  analysis  provided  a 
poor  representation  of  signals  well  localized  in  time.  In  this  case,  it 
is  difficult  to  detect  and  identify  the  signal  pattern  from  the  expan¬ 
sion  coefficients  because  the  information  is  diluted  across  the  whole 
basis.  The  wavelet  packet  transform  (WPT)  is  introduced  as  an  al¬ 
ternative  means  of  extracting  time-frequency  information  from  vi¬ 
bration  signature.  The  resulting  WPT  coefficients  provide  one  with 
arbitrary  time-frequency  resolution  of  a  signal.  With  the  aid  of  sta¬ 
tistical-based  feature  selection  criteria,  many  of  the  feature  com¬ 
ponents  containing  little  discriminant  information  could  be  dis¬ 
carded.  resulting  in  a  feature  subset  having  a  reduced  number  of 
parameters  without  compromising  the  classification.performaxice. 
The  extracted  reduced  dimensional  feature  vector  is  then  used  as 
input  to  a  neural  network  classifier.  This  significantly  reduces  the 
long  training  time  that  is  often  associated  with  the  neural  network 
classifier  and  improves  its  generalization  capability. 

Index  Terms — Condition  monitoring,  diagnosis,  fault  detection, 
wavelet  transform. 


I.  Introduction 

ANY  major  piece  of  industrial  machinery  equipment  re¬ 
quires  a  certain  degree  of  maintenance  to  assure  successful 
operation  over  a  long  period  of  time.  To  achieve  this  objec¬ 
tive,  an  automated  condition  monitoring  system  is  needed.  This 
health  usage  monitoring  (HUM)  system  would  allow  early  de¬ 
tection  of  potentially  catastrophic  faults  that  would  be  extremely 
expensive  to  repair.  It  also  allows  for  implementation  of  condi¬ 
tion  based  maintenance,  and  significant  savings  can  be  made  by 
delaying  scheduled  maintenance  until  convenient  or  necessary. 
Generally,  a  simple  condition  monitoring  system  is  approached 
from  a  pattern  classification  perspective.  It  can  be  decomposed 
into  three  general  tasks:  data  acquisition,  feature  extraction,  and 
condition  classification  [1].  The  most  common  family  of  mon¬ 
itoring  methods  is  based  upon  nondestructive  vibration  mea¬ 
surements  using  multiple  accelerometers  [2]-[8].  The  general 
principle  behind  using  vibration  signals  for  monitoring  involves 
those  components  in  mechanical  systems  that  vibrate  during 
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operation.  When  faults  develop,  some  of  the  system  dynamics 
vary,  resulting  in  significant  deviations  in  the  vibration  patterns. 
By  employing  appropriate  data  analysis  algorithms,  it  is  feasible 
to  detect  changes  in  vibration  signatures  caused  by  fault}7  com¬ 
ponents  and  to  make  decisions  about  the  status  of  the  machinery. 
In  many  of  the  classification  systems  currently  used  (i.e„  neural- 
network-based  systems,  in  particular),  the  process  of  feature  ex¬ 
traction  is  inherently  embedded  in  the  classification  technique 
rather  than  being  identified  as  a  separate  process.  If  a  multilayer 
neural  network  is  used  to  classify  unprocessed  data,  the  input 
layer,  which  learns  from  examples,  will  essentially  serve  as  a 
feature  extractor.  However,  in  problems  such  as  vibration  time- 
series  data,  the  input  dimensionality  of  the  problem  becomes  an 
impediment  to  classification.  Even  neural  networks  are  limited 
by  the  problem  of  parameter  estimation — as  the  number  of  pa¬ 
rameters  increase',  the  amount  of  data  required  to  train  the  neural 
network  must  increase  to  achieve  satisfactory  performance.  For 
a  complex  problem,  obtaining  the  necessary  data  may  be  expen¬ 
sive  or  even  impossible.  Feature  extraction  is  needed  to  reduce 
the  dimensionality7  of  the  data  before  performing  classification. 
This  is  based  upon  the  assumption  that  the  important  structure  in 
the  data  actually  lies  in  a  much  lower  dimensional  space.  Feature 
extraction  involves  preliminary7  processing  of  sensor  measure¬ 
ments  to  obtain  suitable  parameters  that  reveal  whether  an  inter¬ 
esting  pattern  is  emerging.  It  is  generally  not  possible  to  classify7 
machine  conditions  based  upon  an  individual  sample  of  the  vi¬ 
bration.  Therefore,  a  feature  extraction  technique  is  needed  for 
preliminary  processing  of  recorded  time-series  vibrations  over 
a  long  period  of  time  to  obtain  suitable  parameters  that,  in  linear 
and/or  nonlinear  combination,  reveal  whether  a  fault  is  evolving. 
In  general,  this  requires  windowing  of  the  time-series  vibration 
signals  to  form  signal  segments  on  which  linear,  bilinear,  or  non¬ 
linear  transformations  are  applied.  The  aim  of  feature  extrac¬ 
tion  is  to  devise  a  transformation  that  extracts  the  signal  features 
hidden  in  the  original  time  domain.  Corresponding  to  different 
characteristics  of  signals,  transformations  should  be  properly  se¬ 
lected  such  that  specific  signal  structure  can  be  enhanced  in  its 
transformation  domain.  This  would  make  the  following  deci¬ 
sion-maker  design  (i.e.,  for  fault  classification)  much  easier. 

Usually,  the  vibration  signals  of  defective  components  are 
highly  structured  and  can  be  grouped  into  two  categories:  sus¬ 
tained  defects  and  intermittent  defects  [9] .  For  sustained  defects, 
the  signal  is  sinusoidal.  Fourier-based  analysis,  which  uses  sinu¬ 
soidal  functions  as  basis  functions,  provides  an  ideal  candidate 
for  extraction  of  these  narrow-band  signals.  For  intermittent  de¬ 
fects,  features  reflecting  machinery  faults  in  the  pickup  (win¬ 
dowed)  time-series  vibration  signals  neither  appear  in  a  repeti¬ 
tive  manner  nor  consist  of  regular  frequency  components  with 
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the  evolution  of  lime.  Instead,  these  signals  often  demonstrate 
a  nonstationary  anc  transient  nature,  and  carry  small  vet  - in¬ 
formative  components  embedded  in  larger  repetitive  signals .  In 
this  case,  the  Short  Time  Fourier  Transform  (STFTVcan  he  em¬ 
ployed  to  detect  the  localized  transient.  Unfortunately,  the  fixed 
windowing  used  in  the  STFT  implies  fixed  time-frequency  res¬ 
olution  in  the  time-frequency  plane  [10],  [11].  The"*  difficulty  is 
that  the  accuracy  of  extracting  frequency  information  is  limited 
by  the  length  of  the  window  relative  to  the  duration  of  the  in¬ 
teresting  signal.  For  example,  in  helicopter  transmissions,  im¬ 
portant  information  concerning  hearings  can  be  on  the  order 
•  of  tens  of  hundreds  of  hertz,  whereas  mesh  frequencies  and 
important  fundamentals  associated  with  gearing  of  the  engine 
input,  can  be  on  the  order  of  tens  of  thousands  of  hertz.  To 
overcome  the  fixed  time-frequency  resolution  problems,  the  re¬ 
cently  developed  wavelet  based  analysis  [10],  which  provides 
flexible  time-frequency  resolution,  becomes  an  efficient  alter¬ 
native  in  dealing  with  this  type  of  machinery  transient  signals. 
Nonetheless,  linear  expansions  in  a  single  basis,  whether  Fourier 
or  wavelet,  are  not  flexible  enough.  The  Fourier-basis  analysis 
provides  a  poor  representation  of  signals  localized  in  time;  while 
wavelet  bases  are  not  well  adapted  to  represent  signals  whose 
Fourier  transforms  have  narrow  “high”  frequency  support  be¬ 
cause  of  poor  resolution  at  high  frequency,  in  both  cases,  it  is 
difficult  to  detect  and  identify  the  signal  pattern  from  the  ex¬ 
pansion  coefficients  because  information  is  diluted  across  the 
whole  basis.  The  wavelet  packet  transform  (WPT)  [12],  on  the 
other  hand,  uses  a  rich  library  of  redundant  bases  with  arbitrary 
time-frequency  resolution.  Therefore,  it  enables  the  extraction 
of  features  from  signals  that  combine  nonstationary  and  sta¬ 
tionary  characteristics. 

The  collection  of  all  wavelet  packet  coefficients  contains  far 
too  many  elements  to  efficiently  represent  a  signal.  Care  must  be 
taken  in  choosing  a  subset  of  this  collection  in  order  to  manage 
the  computational  complexity  in  practical  situations.  For  classi¬ 
fication  applications,  a  natural  direction  is. to  address  the  issue 
of  finding  a  wavelet-packet-based  feature  set  that  offers  max¬ 
imum  feature  separability  due  to  class-specific  characteristics. 
Our  study  explores  the  feasibility  of  the  WPT.  as  a  tool  in  the 
search  for  features  that  may  be  used  in  the  detection  and  classifi¬ 
cation  of  mechanical  vibration  signals.  In  particular,  we  formu¬ 
late  a  systematic  method  of  determining  wavelet-packet-based 
features  that  exploit  class-specific  differences  among  interesting 
signals.  This  allows  us  to  avoid  human  interaction.  One  could 
simply  input  a  sample  data  set  that  represents  the  ,signals*of  in¬ 
terest  and  receive  as  output  the  dominant  features  that  are  suit¬ 
able  for  classification  purposes.  In  this  study,  we  introduce  a 
novel  methodology  for  classifying  vibration  .signals  based  on 
wavelet  packet  analysis.  We  suggest  that  such  analysis  :can  pro¬ 
vide  a  more  effective  method  of  achieving  robust  classification 
over  the  more  traditional  single  resolution  techniques. 

The  study  investigates  the  use  of  the  wavelet-packet-based 
features  in  the  classification  of  vibration  signals.  In  Section  II; 
we  discuss  the  inefficiency  of  Fourier-based  analysis  for  tran¬ 
sient  signal,  analysis  and  lead  the  reader  to  the  wavelet-based 
analysis — wavelet  transform  and  its  generalization,  the  WPT 
Section  m  presents  an. overview  of  the  proposed  classification 
-system  based  on  wavelet  packet  features.  We  first  describe 


the  feature  measure,  which  will  be  used  throughout  Then, 
we  present  two  feature  selection  methodologies  tha:  aim  to 
reduce  the  input  dimension  for  the  classifier.  In  Section  TV. 
the  feasibility  of  the  proposed  waveiei-pacicet-based  feature 
extraction  technique  is  demonstrated  through  numencal  simu¬ 
lations  of  seed  faults  in  the  Westland  transmission  data  set.  We 
present  our  results  and  discuss  the  performance  with  respect 
to  the  parameters  considered  in  our  investigation.  Finally,  we 
conclude  our  study  in  Section  Y. 

H.  Timt-Frequency  analysis  of  Vibration  Signals 
A.  Fourier-Based  Analysis 

Vibration  signal  classification  generally  requires  windowing 
of  the  time-series  vibration  signals  to  form  signal  segments  on 
which  linear,  bilinear,  or  nonlinear  transformations  are  applied. 
The  Fourier  based  methods,  in  particular,  the  short-time  Fourier 
transform  (SIFT),  are  usually  employed  for  the  extraction  of 
narrow-band  frequency  content  in  signals.  The  difficulty  with 
STFT  is  that  the  accuracy  for  extracting  frequency  information 
is  limited  by  the  length  of  .this  window  relative  to  the  duration 
of  the  signal.  Specificalty,  the  STFT  of  x it)  is  defined  as 


G(f,r)  =  j  x(t)g*(t  -  T)e~j2*ft  dt 


(2.1) 


where  g(t)  is  a  window  function.  The  STFT- decomposes  a 
signal  in  the  time  domain  into  a  two-dimensional  function  in 
a  time-frequency  plane  (/,t).  At  a  given  frequency  ./>  (2.1) 
is  equivalent  to  filtering  a  signal  at  all  times  with  a  bandpass 
filter  having  as  an  impulse  response  the  window  function 
modulated  to  that  frequency  j.  Alternatively,  given  a  segment 
of  signal  windowed  around  time  instant  r,  one  computes  all 
.frequencies  of  the  STFT.  Now,  consider  the  ability  of  the 
STFT  to  discriminate  between  two  pure  sinusoids.  Given  a 
window  function  g{t)  and  its  Fourier  transform  £?(/),  define 
the  bandwidth  A f  of  the  filter  as 


A  f  = 


■i  _  J  P\G(f)\2df 


(2.2) 


I\G(fWdj 

Then,  two  sinusoids  will  be  discriminated  only  if  they  are  more 
than  Af  apart.  Similarly,  the  spread  in  time  is  given  by  At  de¬ 
fined  as  • 


At-  =  I 

•  sm\2dt  ■ 


(2.3) 


So,  two  pulses  in  time  can  be  discriminated  only  if  they  are  more 
than  At  apart.  Thus,  the  resolution  in  frequency  of  file  STFT 
analysis  is  given  by  A/,  and  the  resolution  in  time  is  given 
by  At.  '.One  important  property,  according  to  the  uncertainty 
.principle  [13],  is  that  for  any  suitably  chosen  window  function, 
the  ‘time-bandwidth  product  of  the  window  function  has  lower 
■bound  given  by 


•  At  A/  =  C  >  -7T. 

4 


.(2.4) 


Here,  -c  is  .a  constant  dependent  on  the  choice  of  g(t).  Note  that 
once  The  window  function  g(i)  is  defined,  the  area  (time-band¬ 
width. pro  duct)-  of  the  window  function  in  the  time— frequency 
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Fig.  1.  Time-frequency  representation  of  a  signal  using  different  length  of  analysis  windows. 


plane  remains  fixed.  This  means  we  cannot  increase  the  time  and 
frequency  resolutions  simultaneously.  If  we  choose  a  window 
function  with  small  At  (good  time  resolution),  then  the  corre¬ 
sponding  frequency  resolution  will  be  poor  (A/  will  be  large). 

Consider  analyzing  a  signal  with  different  analysis  window 
functions  to  demonstrate  the  possible  deficiencies  of  fixed 
time-frequency  resolution  associated  with  the  STFT.  In  Fig.  1, 
the  signal  x(t),  as  shown  in  Fig.  1(a),  contains  a  short-duration 
high-frequency  component  and  a  long-duration  low-frequency 
component.  Fig.  1(b)— (e)  represents  the  different  time -fre¬ 
quency  representation  corresponding  to  the  specific  analysis 
windows  shown  in  he  top  row  of  Fig.  1.  Fig.  1(b)  corre¬ 
sponds  to  the  time- domain  representation.  In  Fig.  1(c),  the 
frequency-domain  representation,  the  burst  is  diluted  across  the 
broad  frequency  range,  while  the  long-duration  low-frequency 
component  is  effectively  extracted.  As  we  decrease  the  length 
of  analysis  window  as  illustrated  in  Fig.  1(d)  and  (e),  the 
short-duration  burst  is  more  efficiently  extracted.  However,  the 
long-duration  low-frequency  component  becomes  more  am¬ 
biguous.  Consequently,  if  we  are  analyzing  the  low-frequency 
content  of  a  signal,  we  might  desire  a  wide  window  function  in 
time.  On  the  contrary,  if  we  were  interested  in  high-frequency 
phenomena,  a  short- duration  window  function  would  be  pre¬ 
ferred.  The  STFT  does  not  allow  this  flexibility,  but,  as  we  will 
see  in  the  next  section,  wavelets  give  a  framework  for  which 
this  is  automatic. 


B.  Wavelet  Analysis 


short-duration  high  frequency  and  long-duration  low-frequency 
functions  defined  as 


V'o.rOO  =  ~f=i) 

vl°l 


a  >  0,  reR.  (2.5) 


The  parameter  r  indicates  the  translation  in  time,  and  the  pa¬ 
rameter  a  is  the  scale  parameter.  From  the  scaling  property  of 
Fourier  transforms,  if 


m  ~  *  (ft)  (2.6) 

forms  a  Fourier  transform  pair,  then  . 

Ta*  («)  “  p'7) 

where  a  >  0  is  a  continuous  variable.  Thus,  a  contraction  in 

one  domain  is  accompanied  by  an  expansion  in  the  other,  but 
in  a  nonunifonn  way  over  the  time-frequency  plane.  Depending 
on  the  dilation  parameter  a,  the  wavelet  function  dilates  or  con¬ 
tracts  in  time  causing  the  corresponding  contraction  or  dilation, 
in  the  frequency  domain.  When  a  is  large  (a  >  1),  the  basis 
function  becomes  a  stretched  version  of  the  mother  wavelet 
(c  =  I)  and  demonstrates  a  low -frequency  characteristic.  When 
a  is  small  (a  <  1),  this  basis  function  is  a  contracted  version  of 
the  mother  wavelet  function  and  demonstrates  a  high-frequency 
characteristic. 

Similar  to  the  STFT,  one  can  analyze  a  signal  with  continuous 
wavelet  transform  (CWT)  which  decomposes  a  signal  in  the 
time  domain  into  a  two-dimensional  function  in  the  timescale 
plane  (a,r) 


Fourier-based  analysis  is  based  on  sinusoidal  functions  of 
various  frequencies,  whereas  wavelet  analysis,  on  the  other 
hand,  is  founded  on  basis  functions  formed  by  dilation  and 
translation  of  a  prototype  function  ip  (t) ,  also  known  as  a  mother 
wavelet  [14].  The  wavelet  basis  function  ^atT(f)  is  a  family  of 


=  J  x(t)ipa>T(t)dt.  (2.8) 

The  wavelet  coefficient  ^(a,  r)  measures  the  time-frequency 
content  in  a  signal  indexed  by  the  scale  parameters  and  transla¬ 
tion  parameters.  The  Xsim  frequency  instead  of  scale  has  been 
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ussg  in  order  to  rid  in  understanding,  since  a  wavelet  with  iarzs- 
scale  parameter  is  related  to  low-frequency  content  comnonent, 
and  vies  versa.  Thus,  we  see  that  the  CWT  coirecis  the  noted 
hen cienches  or  the  courier  analysis  as  described  in  the  ■previous 
section.  That  is,  the  CWT  analyzes  the  low-freauency  content  of 
a  signal  with  a  wide  duration  function  and,  converse!}!  analyzes 
high-frequency  phenomena  with  a  shon-rduration  function/ 

In  practice,  calculating  wavelet  coefficients  at  every  Dossibie 
scale  using  (2.8)  is  a  fair  amount  of  work  and  generates  a  lot 
of  redundant  data.  It  rums  out  that,  if  we  limit  the  choice  of  a 
and  r  in  (2.5)  to  discrete  numbers,  then  our  analysis  will  be  suf¬ 
ficiently  accurate.  In  particular,  if  we  choose  scale  and  transla¬ 
tion  parameters  (he.,  j  and  L  respectively)  based  on  powers  of 
two,  then  there  exists  rib{t)  with  good  time -frequency  localiza¬ 
tion  property.  The  set  of  functions 

ibjrk  It)  =  2j/2ip(2jt  -  k ) ,  j ,  k  €  Z  (2.9) 

constitute  an  ortbonoimal  basis  for  I?  (3?)  [15] .  Here,  Z  denotes 
the  set  of  integers,  and  Z2(Sft)  denotes  the  class  of  measurable 
functions  x(t)  in  5R  satisfying: 

/  |£(t)|2  dt  <  oo.  (2.10) 

■ 

Any  signal  x(t)  in  L2(3£)  can  then  be  expressed  as 

*(*)  =  T}(/,  (2.H) 

j,k 

This  is  called  the  discrete  wavelet  transform  (DWT).  In  prac¬ 
tice,  the  implementation  of  the  DWT  suitable  for  finite-length 
discrete-time  signals  is  based  upon  the  multiresolution  analysis 
introduced  bj'  Mallat  [16].  The  development  leads  to  a  compu¬ 
tationally  efficient  algorithm  known  as  the  fast  wavelet  trans¬ 
form  (FWT).  By  introducing  a  new  function,  the  scaling  func¬ 
tion  the  orthogonal  wavelets  could  be  constructed  and  in¬ 
corporated  into  a  system  that  uses  cascaded  filters  to  decompose 
a  signal.  Specifically,  the  wavelet  %p{t)  is  often  generated  from 
4>{t)  through  the  following  dilation  equations: 

d>(t/2)  =  -  k)  (2.12). 

k 

t/)(f/2)  =  V2Y^9(k)<j>(t  -  k).  (2.13) 

Daubechies  [15]  has  developed  a  procedure  to  solve  the  dila¬ 
tion  equations  such  that  the  sequences  h{k)  and  g{k)  have  only 
finiie  nonzero  coefficients,  which  leads  to  a  very  efficient  al¬ 
gorithm  for  computing  wavelet  coefficients.  In  general,  h(k)  is 
the  coefficient  of  the  low-pass  filter,  whereas  g(k)  represents 
a  high-pass  filter.  This  practical  filtering  algorithm  is,'  in  fact, 
a  classical  scheme  known  as  &  two-channel  subband  coding 
using  quadrature  mirror  filters  (QMF’  s)  [17] .  A  consequence  of - 
multiresolution  is  that  we  can  transform  a  signal  into  wavelets 
without  using  wavelets  or  scaling  functions.  In  general,  these 
functions  do  not  exist  as  explicit  functions;  they  are  limits  of 
iterations.  To  compute  the  wavelet  transform  all  we  need  are 
filters.  Rather  than  taking  the  scalar  product  of  the  scaling  func¬ 
tion  or  the  wavelet  with  the  signal,  we  convolve  the  signal  with 


these  filters.  Fig.  2  demonstrates  the  wavelet  decomtcosinon  nro- 
ceaure  ana  show's  the  time-frequency  plane  corresponding  to  a 
wavelet  decomposition.  In  contras:  with  STFT.  the  time  reso¬ 
lution  oecomes  arbitrarily  fine  at  high  frequency,  while  the  fre¬ 
quency  resolution  becomes  arbitrarily  fine  at  low  freauencies. 
Note  that,  in  FWT,  the  number  of  points  is  gradually  decreased 
through  successive  decimation.  Tnus.  if  we  start  with  a  simal  of 
2J  points,  then,  in  the  following  level,  we  have  2"7- 1  w'aveiet  co¬ 
efficients.  Therefore,  the  maximum  decomposition  level  is  eaual 
to  J. 

C,  WPT 

Whereas  the  wavelet  transform  provides  one  with  more 
flexible  time-frequency  resolution  properties  as  described,  one 
possible  drawback  is  that  the  frequency  resolution  is  rather 
poor  in  the  high-frequency  region.  Therefore,  it  faces  some 
difficulties  for  discrimination  between  signals  having  close 
high-frequency  components.  Wavelet  packets,  a  generalization 
of  wavelet  bases,  are  alternative  bases  that  are  formed  by  talcing 
linear  combinations  of  the  usual  wavelet  functions  [12],  [18], 
These  bases  inherit  properties  such  as  orthononnality  and 
time-frequency  localization  from  their  corresponding  wavelet 
functions.  A  wavelet  packet  function  is  a  function  with  three 
indices:  As  with  usual  wavelets,  integers  j  and  k  are 

index  scale  and  translation  operations,  respectively  ' 

W£fc  (i)  =  2j'2Wn(2H  -  k).  ,  (2.14) 

The  index  n  is  called  the  modulation  parameter  or  the  oscilla¬ 
tion  parameter.  The  first  two  wavelet  packet  functions  are  the 
usual  scaling  function  and  mother  wavelet  function,  respectively 

wo,o(-t)  =  4>{t)  (2.15) 

wo,o  =  ■  (2.16) 

Wavelet  packet  functions  for  n  =  2, 3,  •  •  •  are  then  defined  by 
the  following  recursive  relationships: 

W$(t)  =  V2£  h(k)W?ik(2t  -  k)  (2.17) 


=  V2j^g{k)W?ik(2t-k)  (2.18) 

h 

where  h(k )  and  g(k)  are  the  QMF  associated  with,  the  prede¬ 
fined  scaling  function  and  mother  wavelet  function.  To  measure 
specific  time-frequency  information  in  a  signal,  we  simply  rake 
the  inner  product  of  the  signal  and  that  particular  basis  function. 
The  wavelet  packet  coefficients  of  a  function  /  can  be  computed 
via 

=  </,  w&)  =  J mw*k(t)dt.  (2.19) 

The  idea  of  .the  usual  wavelet  decomposition,  as  shown  in 
Fig.  2,  is  .generalized  to  describe  the  calculation  of  wavelet 
{packet. coefficients  Wjj7lik  of  a  discrete-time  signal  Computing 
the  full  waveletpacket  decomposition  (WPD)  of  a  discrete-time 
signal  involves  -.applying  both  filters  to  the  discrete-time  .signal 
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Fig.  2.  Wavelet  decomposition  of  time-domain  signal. 
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Fig.  3.  Implementation  of  discrete  WPD. 

[xi,X2,  •  •  •  ,xn]  and  then  recursively  to  each  intermediate 
signal.  The  procedure  is  illustrated  in  Fig.  3. 

Note  that  the  method  of  decomposition  described  above  does 
not  result  in  a  WPT  tree  displayed  in  increasing  frequency  order. 
This  is  because  aliasing  occurs,  which  exchanges  the  frequency 
ordering  of  some  nodes  of  the  tree.  A  simple  swapping  of  the  ap¬ 
propriate  nodes  results  in  the  increasing  frequency  ordering  re¬ 
ferred  to  as  the  Paley  ordering  [  1 8]  of  the  tree,  as  shown  in  Fig.  4. 
Here,  the  dashed  lines  highlight  the  difference  with  Fig.  3.  In 
this  way,  the  leftmost  node  at  each  level  will  correspond  to  the 
lowest  frequency  band.  In  following  sections,  we  will  use  this 
representation  for  easier  interpretation. 

Whereas  the  FWT  decomposes  only  the  low-frequency  com¬ 
ponents,  WPT  decomposes  the  signal  utilizing  both  the  low-fre¬ 
quency  components  and  the  high-frequency  components.  This 
flexibility  of  a  rich  collection  of  abundant  information  with  ar¬ 
bitrary  time-frequency  resolution  allows  extraction  of  features 
that  combine  nonstationary  and  stationary  characteristics. 

m.  Feature  Selection 

A.  Overview 

The  WPT  is  applied  in  classification  problems  based  on  time- 
series  vibration  signatures.  First,  the  vibration  data  is  decom¬ 
posed  via  the  WPT  to  extract  the  time-frequency-dependent  in¬ 
formation.  Features  are  then  defined  based  upon  the  WPD  coef¬ 


Fig.  4.  The  WPD  tree  displayed  in  Paley  order. 

ficients.  Second,  simple  statistical  processing  based  on  discrim¬ 
inant  analysis  is  applied  to  identify  a  set  of  robust  features  that 
provides  the  most  discrimination  among  the  classes  of  vibra¬ 
tion  data.  Then,  a  neural  network  classifier  is  trained  based  on 
this  reduced  feature  set.  With  statistical-based  feature  selection 
criteria,  several  feature  components  containing  little  discrimi¬ 
nant  information  can  be  discarded,  resulting  in  a  feature  subset 
having. a  reduced  number  of  parameters.  This  will  significantly 
ease  the  design  of  the  neural  classifier  and  enhance  the  gener¬ 
alization  capability  of  the  system.  In  the  following  sections,  we 
define  the  WPD-based  feature  measurement  used  in  this  study. 
Tnen,  we  discuss  some  feature  selection  criteria  and  present  the 
ones  applied  in  this  study  to  reduce  the  number  of  feature  vari¬ 
ables. 

B.  Feature  Measure 

One  deficiency  that  wavelet  bases  inherently  possesses  the 
lack  of  a  translation-invariant  property.  To  illustrate  this  by  ex¬ 
ample,  consider  two  signals  with  a  slight  shift  in  time,  as  shown 
■in  the  left-hand  side  of  Fig.  5.  When  the  two  signals  are  decom¬ 
posed  via  the  WPT,  we  can  see  appreciable  differences  between 
the  two  representations  of  the  signals  as  shown  in  the  right-hand 
side  of  Fig.  5  (a  darker  color  corresponds  to  a  larger  WPD  co¬ 
efficient  value).  Therefore,  direct  assessment  from  all  wavelet 
packet  coefficients  often  turns  out  to  be  tedious  or  leads  to  in¬ 
accurate  decisions. 

Recall  that  each  wavelet  packet  coefficient  is  given  by 

=  </.  =  (/,  2^2Wn(2 H  -  k))  (3.1) 

where  j  is  a  scaling  parameter,  k  is  a  translation  parameter,  and 
n  is  an  oscillation  parameter.  Each  Wj)n^  coefficient  measures 
a  specific  subband  frequency  content,  controlled  by  the  scale 
parameter  j  and. the  oscillation  parameter  n,  of  a  signal  around 
time  instant  2H . 

We  define  the  wavelet  packet  node  energy  as 

gJ>  =  y ~^wj,n,k 


(3.2) 
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(c) 

Fig.  5.  WPD  of  time-shifted  signals,  (a)  si.  (b)  WPT  of  si.  (c)  s2.  (d)  WPT  of  s2. 


(d) 


(b) 

Fig.  6.  Wavelet  packet  node  energy  of  time-shifted  signals,  (a)  Energy  map  of  si.  (b)  Energy  map  of  s2. 


which  measures  the  signal  energy  contained  in  some  specific 
frequency  band  indexed  by  parameters  j  and  n.  In  the  following, 
we  will  call  each  (j,  n)  a  wavelet  packet  node.  Fig.  6  displays 
the  energy  distribution  that  is  calculated  based  on  all  coefficients 
in  each  wavelet  packet  node  of  the  two  signals  given  in  Fig.  5. 
We  can  see  those  node  energy  values  at  level  two,  three,  or  four 
show  no  clear  difference  between  the  two  signals  This  example 
reveals  that  the  node  energy  representation  provides  us  with  a 
more  robust  signal  feature  for  classification  than  using  coeffi-- 
-cients  directly.  In  our  strategy,  each  wavelet  packet  node  energy 
value  was  defined  as  an  individual  feature  component  and  was 
used  as  a  robust  rudimentary  exploration  of  the  specific  signal 
features  that  provide  useful  mfonnation.for  classification  pur- 
poses. 


•C  Feature  Dimension  Reduction 

•One  advantage  of  using  WPT  to  decompose  a  signal  is  that  it 
allows  us  to  . examine  different  time-frequency  resolution  com¬ 
ponents  in  a  signal.  For  example,  by  computing  the  full  WPD 
on  -a  signal  segment  with  n  =  2J  points  for  r  resolution  levels 
{where  J  and  r  are  positive  integers),  the  result  is  a  group  of 
^  ^  H-  •  ••  *  +  2r  =  2r+1  —  2  sets  of  coefficients  where  each 

■set  corresponds  to  a  wavelet  packet  node.  If  the  node  energy  as 
described  before  is  used  as  a  feature,  we  can  obtain  2T”i_1 — 2  fea¬ 
ture  components.  However,  direct  manipulation  of  a  whole  set 
of  node  energies  is  prohibitive  because  the  space  normally  has 
^very  high  dimensionality,  and  the  existence  of  undesired  com¬ 
ponents  makes  the  classification  unnecessarily- difficult.  In  the 
training  of  :a  .  neural  network  ,  classifier,  it  is  desirable  to  use  a 
lower  iimiensional  vector  as  input  to  the  neural  network  to  ease 
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feature  2 


feature  ] 


Fig.  7.  Example  of  feature  extraction  for  classification. 

the  design  of  the  classifier  and  improve  its  generalization  capa¬ 
bility. 

One  popular  technique  in  reducing  the  feature  dimensionality 
is  the  Karhumen-Loeve  (K-L)  transform  [19].  The  K-L  trans¬ 
form  is  optimal  for  “ signal  representation ”  in  the  sense  that 
it  provides  the  smallest  mean-square  error  for  a  given  number 
of  data.  However,  the  features  defined  by  the  K-L  transform 
are  not  optimal  for  “ class  separability .”  As  an  example,  the 
data  from  two-class  categories  with  a  Gaussian  distribution  are 
shown  in  Fig.  7.  In  the  sense  of  K-L  transform,  the  principal  axis 
1  with  a  larger  eigenvalue  is  a  better  vector  than  axis  2  to  repre¬ 
sent  the  vectors  of  this  distribution.  That  is,  the  selection  of  axis 
1  produces  a  smaller  mean- square  error  of  representation  than 
the  selection  of  axis  2  alone.  However,  as  seen  in  Fig.  7,  if  the 
two  distributions  are  mapped  onto  axis  1 ,  the  marginal  density 
functions  are  heavily  overlapped.  On  the  other  hand,  if  they  are 
mapped  onto  axis  2,  the  marginal  densities  are  well  separated. 
Therefore,  for  classification  purposes,  axis  2  is  a  better  feature 
than  axis  1  alone,  preserving  more  classification  information. 

As  described  previously,  it  is  not  the  mean-square  error,  as  in 
the  sense  of  K— L  transform,  but  the  classification  accuracy  that 
should  be  considered  a  primary  criterion  for  reducing  the  fea¬ 
ture  dimension.  The  ability  to  classify  patterns  relies  on  the  im¬ 
plied  assumption  that  different  classes  occupy  distinct  regions 
in  the  pattern  space.  Intuitively,  the  more  distant  the  classes  are 
from  each  other,  the  better  the  chance  of  successful  recogni¬ 
tion  of  class  membership  of  patterns.  One  transformation  associ¬ 
ated  with  this  assumption  is  based  on  within-  and  between-class 
scatter  matrices  that  are  used  in  linear  discriminant  analysis 
(LDA)  of  statistics:[20].  The  idea  is  to  find  a  linear  transforma¬ 
tion  that  projects  the  samples  onto  a  lower  dimensional  space  in 
which  the  variability  of  samples  within  each  class  is  as  close  as 
possible,  and  the  dispersion  of  the  class  mean  vectors  about  the 
mean  vector  is  as  separated  as  possible. 

Specifically,  consider  an  L-class  problem.  The  class  sample 
covariance  matrices  measure  the  variability  of  samples  within 
each  class 


Kc 

Sc  =-(l /Ne)  52  (xf  -  mc)  (xf-  mc)T 

i=l 

c  =  1, 2*  -  - 


*L  (3.3) 


where  rr  is  a  sample  vector  belonging  to  ciass  c.  J\7  is  the 
number  of  samples  belonging  to  ciass  c,  and  rr.c  is  the  mean 
vector  of  class  c 

AL 

=  (5.4/ 

As  a  result,  the  overall  within-class  variability  can  be  estimated 
by  the  sample  covariance  matrix 


^ w  —  Y.PcSc  (3.5) 

c=i 

where  pc  is  the  a  priori  probability  of  class  c.  Similarly,  the 
between-class  covariance  matrix  measures  the  dispersion  of  the 
class  mean  vectors  about  the  overall  mean  vectors 

L 

sb  =  y2pc{mc-m)(mc-m)T  (3.6) 

C=  1 


where  m  represents  the  expected  vector  of  the  mixture  distribu¬ 
tion  and  is  given  by 


L 

m  =  J2pcmc.  (3.7) 

C=1 


Now,  if  x  =  AT x  denotes  a  linear  transformation  of  the  original 
variables,  then  the  between-  and  within-class  covariance  ma¬ 
trices  in  the  transformed  space  can  be  found  as  Sj  =  ATSbA 
and  Sw  =  AtSwA.  The  goal  is  to  find  a  subspace  where  the 
ratio  of  Sb  and  Sw  is  maximized.  In  this  case,  it  may  be  mea¬ 
sured  by  the  ratio  of  the  determinant  of  the  preceding  matrices 
(i.e.,  the  determinant,  being  the  product  of  the  eigenvalues,  is  the 
product  of  the  variance  in  the  principal  directions).  The  problem 
could,  thus,  be  formulated  so  as  to  find  a  transformation  A  such 
that 


A  =  argmax 

A 


\ATSbA\ 

\ATSWA[ 


(3.8) 


The  solution  for  (3.8)  is  given  by  the  min (n,L  -  1)  eigenvec¬ 
tors  of  Sw  1  Sb  where  n  is  the  dimension  of  the  original  data  set 
[20],  Once  the  transformation  map  A  is  obtained,  then  the'fea- 
ture  vector  ATx  is  computed  for  each  sample  and,  finally,  it  is 
assigned  to  the  class  which  has  the  mean  vector  closest  to  this 
feature  vector.  Although  the  vector  found  by  LDA  works  well 
in  most  cases,  several  drawbacks  might  occur  in  practice.  First, 
when  we  apply  LDA  to  extract  the  discriminant  feature  vector, 
the  mathematical  procedure  automatically  combines  the  feature 
extractor  and  the  classifier  in  a  linear  form.  By  restricting  the 
form  or  criterion  of  the  mapping,  we  implicitly  assume  an  over- 
simplistic  model  of  the  pattern  recognition  system.  Such  a  situ¬ 
ation  will  arise  if  the  classes  are  not  linearly  separable  and  we 
restrict  the  feature  extractor  to  a  linear  form. 

Moreover,  since  LDA  involves  the  computation  of  the  in¬ 
verse  of  the  covariance  .matrices,  it  may  lead  to  numerical  prob¬ 
lems,  especially  when  the  matrices  are  estimated  based  on  a 
limited  data  set.  In  our  application  on  the  classification  of  vi¬ 
bration  signal  data  collected  from  multisensors,  we  might  have 
thousands  of  time-frequency  feature  components,  while  only 
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nunarsas  01  Training  samples  are  available.  For  example,  given  a 
256-point  signal,  full  decomposition  of  the  signal  10  the  seventh 
level  and  use  of  node  energy  as  toe  fearure  comnonem  will  result 
m  a  254- dimensional  leamre  vector.  Combining  all  feature  vec¬ 
tors  from  multiple  sensors,  say  eight,  will  result  in  a  2032-di¬ 
mensional  vector.  However,  only  a  few  eigenvalues,  such  as  ten. 
are  dominant,  i.e., 

Aj  —  A2  *4 - r  ^2032  =  A ]  -i - r  A1Cl.  (3.9) 

This  means  that,  in  a  practical  sens e,  we  are  handling  Su,  with 
rank  ten,  even  though  the  mathematical  rank  of  Sw  is  still  2032. 
i.e.,  A,-  =  jO,Vi.  In  the  calculation  of  S~*,  the  determinant  of 
Sw  is  ELi-  A?  and  A i,  f  =  11.  ■  •  ■ ,  2032  are  very  close  to  zero. 


Suppose  At  4 - f  A10  =  0.9  out  of  T)^2  A*  =  1,  then 

10  2032  10 

JJ  A{  x  (0.1/2022)2022  £*  0  (3.10) 

V-=1  ■  j=ll  r=l  _ 


for  the  assumption  AX1  =  Ai2  =  •  •  ■  =  A2032  =  0.1/2022. 

This,  indeed,  leads  to  some  numerical  instability  in  handling 
such  a  near-singular  matrix.  For  this  reason,  we  resort  to 
employing  the  feature  selection  in  feature  measurement  as  de¬ 
scribed  in  the  following  section,  which  considers  the  numerical 
problems  of  calculating  the  inverse  of  covariance  matrices  as 
LDA  does.  Instead  of  trying  to  find  a  linear  transformation  to 
reduce  the  dimensionality,  we  evaluate  the  discriminant  power 
of  each  individual  feature  component  and  discard  those  feature 
components  containing  little  class  separability  information 
:  as  measured  by  selected  criterion.  Then,  a  neural  network  is 

employed  as  a  classifier  to  deal  with  nonlinearly  separable 
cases  in  the  feature  space. 

The  idea  behind  feature  selection  in  feature  measurement 
space  is  to  select  the  feature  components  that  contain  discrim¬ 
inant  information  and  discard  those  feature  components  that 
provide  little  information  useful  for  classification  purposes 
[21].  Specifically,  the  feature  component  {/;- 1  k  =  1, 2,  •  •  • ,  n} 
is  ranked 

r  J(/i)  >  Ah)  >  >  J(fd)  >  >  J{fn)  (3.11) 

i.  - 

1 ;.  where  J  ( ■ )  is  a  criterion  function  for  measuring  the  discriminant 

[;■  power  of  a  specific  feature  component.  The  feature  subset  can 
be  selected  from  the  available  features  that  have  larger  criterion 
function  values. 

To  obtain  a  clearer  picture  of  measuring  the  discriminant 
power  of  a  feature,  it  is  essential  to  introduce  the  concept 
of  probabilistic  structure  of.  classes.  Consider  the  probability 
.density  function,  of  class  cl  and  c2  given  .in  Fig.  8.  For  a 
specific  feature  variable  xt  if  p(x  |  cl)  is  zero  for  all  x  such 
that  p(x ;|  c2)  ^  0  as  illustrated  in  .Fig.  8(a),  then  these  two 
classes  can  be  fully  separable.  ;On  .the  ..other  hand,  when 
p(x  |  cl)  =  p( x  |  c2)  as  in  Fig.  -8(b),  it  is  impossible  to  dis¬ 
tinguish  elements  of  class,  cl  from  those  belonging  to  c2. 
Intuitively,  a  criterion  function  for  evaluating  the  discriminant 
power  of  a  feature  could  be  assessed  by  measuring  the  overlap 
between  p(x  |  cl)  and  p{x  |  c2).  A  high  overlap  corresponds  to 
a  low  discriminant  power  and  vice  versa. 


p.d.f.(x) 


(a)  - 

p.d.f.(x) 


(b) 

Fig.  8.  Probability  density  functions  of  (a)  two  well-separated  classes  and  (b) 
two  completely  overlapping  classes. 


TABLE  I 

"Westland  Helicopter  Gearbox  Data.  Description 


Fault  Type  Number 

Description 

1 

No  Defect 

2 

Planetary  Bearing  Corrosion 

3 

Input  Pinion  Bearing  Corrosion 

4 

Spiral  Bevel  Input  Pinion  Spalling 

5 

Helical  Input  Pinion  Chipping 

6 

Helical  Idler  Gear  Crack  Propagation 

7 

Collector  Gear  Crack  Propagation 

8 

Quill  Shaft  Crack  Propagation 

In  general,  a  criterion  function  for  measuring  the  overlap  be¬ 
tween  classes  has  the  following  properties  [21]. 

1)  The  measure  is  minimum  when  the  conditional  proba¬ 
bility  density  function  for  class  cl  and  c2  axe  identical, 
•i.e. 

/(•)  =  o,  if  p(x  J  cl)  =  p{x  |  c2).  (3.12) 

2)  The  measure  is  nonnegative. 

3)  The  measure  attains  a  maximum  when  the  classes  are  dis¬ 
joint,  i.e.,  - 

J{')  =  max, 

•if 37(2;  |  cl)  =  0  whenp(a:.|c2)#0,  Vx.  (3.13) 
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Dimension  of  final  Feature  Vector  Using  One  Sensor 


PWM 

KMK 

Sensor  1  -Sensor  2 
32  33 

.34  36 

Fourier  Feature 

Sensor  3 
42 

46 

Sensor  4 
31 

45 

Sensor  5 
38 

34 

Sensor  6 
47 

34 

Sensor  7 
39 

50 

Sensor  8 
50 

33 

PWM 

KNK 

Sensor  1 
31  . 

'28 

Sensor  2 
37 

30 

Sensor  3 

Sl- 

SV 

Sensor  4 
31 

35 

•Sensor  5* 
43 

31 

Sensor  6 
51 

35 

Sensor  7 
37 

37 

Sensor  8 
54 

35 

TABLE  m 

Classification  Performance  (Sensors  1  and  2). 


Sensor  1 

Sensor  2 

PWM 

KNK 

Tr.  Err. 

Test  Err. 

Tr.  Err. 

Test  Err. 

WPT 

3.25 

21.92 

2.75 

24.50 

FT 

0.75 

4.25 

1.25 

4.58 

WPT 

1 

4 

1.00 

4.33 

FT 

•0 

2.17' 

0.25 

1.92 

TABLE  TV 

Classification  Performance  (Sensors  3  and  4) 


Sensor  3 

Sensor  4 

PWM 

KNK  - 

Tr.  Err. 

Test  Eir. 

Tr.  Err. 

Test  Err. 

.  WPT 

0.75 

6.75 

0.75 

7.25 

FT 

0.25 
•  1.75 

0.50 

1.42 

WPT 

2.25 

6.42 

1.25 

8.00 

FT 

1.25 

5.83 

1.75 

5.08 

^e®Cient  0111511011  faction  known  as  Fisher’s  enteric 
[_0],  In  a  two-classes  problem  it  is  given  by 


Although  the  above  properties  provide  an  intuitive  justifi¬ 
cation  of  their  suitability  for  feature  selection,  their  relative 
potennal  can  be  assessed  only  if  their  relationship  to  the 
classification  eiror  is  known.  Nevertheless,  these  measures 
are  closely  related  to  the  error  probability.  This  relationship 
is  a  consequence  of  the  fact  that  the  measures  give  a  direct 

bility  weXfa  Sple  Tf i't  d?^  “  ^  *  ?  ** 

2110  %/*«  are  the  variance 


alh  + 


(3.1- 
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TABLE  VI 

CLASSIFICATION  PERFORMANCE  (SENSORS  7  AND  8) 


Sensor  7 

Sensor  8 

PWM 

.KNK 

Tr.  Bn. 

Test  Bn. 

Tr.  Eir. 

Test  Err. 

WPT 

1 

2.25 

1.50 

5.08 

FT 

1.5 

2.42 

0.00 

3.08 

WPT 

0.5 

62 

3.25 

61.08 

FT  , 

0 

46.83 

1.75 

45.17 

the  kth  feature,  for  class  i  and  j  correspondingly.  When 
there  are  more  than  two  classes  of  data,  the  general  approach 
is  to  lake  the  summation  of  the  pairwise  combinations  of 

L-l  L 

Jh  =  Y.Y1  A(iJ)  (3.15) 

i= 1  1 


2)  For  each  class  pair  (ij),  sort  Jfk  ( i,j )  such  that 
Jh  (*  J)  ^  >  ■  ■  ■  > 

(3.17) 

Determine  the  feature  subset  for  each  class  pair 
.  by  selecting  d  feature  components  that  have  ma-rimum 
JjJi-.j)  value 


as  an  estimation  of  discriminant  power  for  the  specific  fea¬ 
ture  fk.  Here.  L  represents  the  number  of  classes  in  the 
problem.  Equation  (3.15)  provides  us  with  a  measure  to 
evaluate  the  effectiveness  of  the  “ global’  feature  that  is  si¬ 
multaneously  suitable  to  differentiate  all  classes  of  signals. 
For  a  small  number  of  classes,  this  approach  may  be  suffi¬ 
cient.  The  more  signal  classes,  the  more  ambiguous  (3.15) 
becomes.  A  large  value  for  (3.15)  may  be  due  to  a  few 
significant  terms  with  .negligible  majority  (a  favorable  case) 
or  to  the  accumulation  of  many  terms  with  relatively  small 
values  (an  unfavorable  case).  A  feature  that  can  effectively 
differentiate  a  pair  of  signal  classes,  i.e.,  with  a  large  dis- 
cmninant  measure,  as  calculated  by  (3.15),  might  be  averaged 
during  the  pairwise  summation.  Note  the  feature  selection  is 
performed  to  ease  the  neural  network  classifier  design  (i.e., 
in  training  process).  To  avoid  such  a  problem,  we  propose 
two  feasible  approaches  as  described  .below. 

1)  Approach  I  (PWM):  Instead  of  trying  to  identify  features 
that  are  effective  for  the  entire  multiclass  problem  globally  as 
measured  by  (3.15),  we  select  a  feature  subset  based  on  .(3.14) 
for  each  possible  pair  of  classes.' Then,  we  take  the  union  of 
feature  components  selected  from  each  pair  of  classes  to  form 
the  final  feature  vector.  Specifically,  given  an  i-class  problem 
with  n  feature  components,  the  selection  pro  cess  is  detailed  in 
the  following. 

1)  For  each  possible  class  pair  {(2.3)  \i  —  ^2,  ♦  . .  £  _ 
1>  i  .  =  i  +  13  i  ■+■  2,  ■  r  v-ij-s  "calculate  the  discriminant 
power  measure  for  each  feature  component,  i.e., 


j.  a  j)  =  l‘ 


?  =  L2, •••,!  —  1;  j  =  »  +  l,i  +  2,-..,L.  (3.18) 

3)  Fonn  the  final  feature  set  by  taking  the  union  of  each 
feature  subset 


2)  Approach  II  (KNK);  Another  approach  to  avoid  the  in¬ 
fluence  of  the  pairwise  summation  process  is  similarly  sug¬ 
gested  by  Watanabe  [22].  Given  an  T-c lass  signal  classification 
problem,  we  can  consider  the  class  q  signals  as  the  conceptual 
opposite  of  the  class  q  signals,  which  is  the  ensemble  of  data  be¬ 
longing  to  classes  other  than  the  q  class.  Then,  we  apply  Fisher’s 
criterion  as  was  done  in  the  two-class  problems  to  evaluate  the 
cfiscriminant  power  of  each  individual  feature  component. 

1)  For  each  class  q  =  1,  2,  •  •  • ,  L,  we  partition  the  data  set 
to  class  q  signals  and  class  q  signals.  In  this  wa}',  we  can 
get  L  sets  of  data  that  can  be  used  for  selecting  features. 

2)  For  each  of  the  L  sets,  use  Fisher’s  criterion  to  evaluate 
the  discriminant  power  for  each  feature  component 

(ff;  9)  =  ~  V  -  •  (3-20) 

ag,fk+eTijk  '  ' 

3)  For  each  of  the  L  sets,  sort  Jjk{q:  q)  such  that 

Jh  (9>  9)  ^  Jh  (9, 9)  >  •••>  Jfd  (g,  q)  >  •  ■  •  >  Jin  (g,  g). 

(3.21) 


(3.16) 
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TAELE  VU 

Dimension  o?  Final  Feature  Vector  Using  Eight  Sensors 


PWM 

m; 


PWM 

KNIC 


Waveiet  packer  feature 
~~  “d=2 


0=1 

7  14 

8  16 

Fourier  Feature 
c=l  o*=? 

9  16 

8  16 


C-- 

o=4 

c=5 

o=6 

o=7 

d=S 

26  ■ 

35 

42 

50 

58 

6^ 

24 

32 

39 

47 

55 

63 

6-3 

d=4 

d=5 

d=6 

d=7 

c=S 

24 

27 

29 

30 

32 

38 

24 

32 

40 

48 

5! 

5  c 

TABLE  vnr 

Classification  Performance  (Eight-Sensor  Data;  PWM) 


FT 


d=2 

0.5 


Tr.  Err. 

Test  Eu.  0.92 
Tr.  Err.  .  0 
Test  Err.  0 


d=2 

0 

0.17 

0 

0 


o=3 


d=4 


o=o 


d=6 


0 

.0.17 

0 

0.25 


0 

0.25 

0 

0 


0=7 

0 

0.08 

0 

0 


ri=£ 


0 

0.25 

0 

0 


TABLE  IX 

Classification  Performance  (Eight-Sensor  Data;  KNK) 


d=2 


FT 


Tr.  Err. 
Test  En. 

0 

0.08 

0.25 

0 

Tr.  En. 

2.75 

0 

Test  En. 

2.17 

0.33 

d=3 

0 

0 

0 

0 


d=4 

0 

0 

0 

0.08 


d=5 

0 

0 

0 

0 


0 

0.08 

0 

0 


d=7 

0 

0.17 

0 

0 


Determine  the  feature  subset  Fqrg  for  each  of  L  set  by 
selecting  d  feature  components  that  have  maximum 
JJk  (Q,  q)  value 

■^9.5  =  {/*|^  =  1,2,  ■  ■■,(!},  5  =  1,2 ,•••,!,.  (3.22) 

4)  Form  the  final  feature  set  by  talcing  the  union  of  p^h 
feature  subset 

-Ffinal  =  jlJ  ^9.9  j  '-  (3.23) 

Suitable  feature  components,  which  offer  favorable  class  sep¬ 
arability  measure,  are  found  as  described.  Many  classifiers  can 
then  be  designed  based  on  these  features.  A  feedforward  neural 
network  is  employed  in  this  study  because  of  its  capability  in 
dealing  with  nonlinearly  separable  distributions. 

D.  Neural  Networks  Classifier 

Once  suitable  features  have  been  extracted  and  selected  from 
the  vibration  data  as  discussed  above,  it  is  necessary  to  deter¬ 
mine  the  failure  mode  based  upon  these  features.  Ideally,  the 
features  for  normal  and  faulty  conditions  will  occupy  nonover¬ 
lapping  areas  in  the  feature  space.  If  not,  then  the  classification 
algorithm  will  have  to  approximate  a  Bayes  classifier  [23], 

Consider  aX-class  problem.  The  probability  that  a  particular 
patterns,  comes  from  class  a,i  =  1, 2, •  •  - , L  is  denoted  as 
P(Ci  |  x).  If  the  pattern  classifier  decides  that  x  came  from  c- 
when  it  actually  came  from  c,,  it  incurs  a  loss,  denoted  l(i  \j). 


d=8 

0 

O.OS 
0  , 
0 


As  pattern  x  may  belong  to  any  one  of  L  classes  under  consid¬ 
eration,  the  average  loss  incurred  in  assigning  x  to  class  c,-  is 

L 

rs(x)  =  |  j)p(ck  I  x).  (3.24) 


k=  1 


In  general,  the  loss  for  a  correct  decision  is  zero  (i.e.,  k  =  j )  and 
it  has  the  same  nonzero  value  (e.g.,  1)  for  any  incorrect  decision. 


i.e. 


where 


l{k\j)  =  l-5ktj 


f,,-/1-  *  =  > 


0,  k  j. 


(3.25) 


(3.26) 


Then,  the  loss  of  assigning  a  pattern  x  to  class  cj  becomes 

rAx)  =  I1)-  (3.27) 

. 11116  classffier  lias  L  possible  classes  to  choose  from  for  any 
given  unknown  pattern  x.  If  it  computes  rj  (x),j  =  1,2  ••  f  £ 
for  each  pattern  x,  and  assigns  the  pattern  to  the  class  with 
smallest  loss,  then  total  average  loss  with  respect  to  all  deci- 
sions  will  be  minimum.  The.  classifier  that  miriniizes  (3.27)  is 
called  the  Bayes  Classifier.  Thus,  the  Bayes  classifier  assigns  an 
unknown  pattern  vector  x  to  class  c,  if 


ri.(*)  < ri(x),.  j  =  1,2, •••,£;  y#z.. 


(3.28) 
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PSD  of  wSoiJaUSSian  “0iS£  81113  3tS  P°Wer  SpeCmm-  (a)  Wlute  noise-  <*» 

Substituting  (3.27)  into  (3 .28),  the  decision  rule  is  then  to  choose 
label  Ci  if 

T^pick  |x)  <  |z), 

k^j 

J  =  1,2, j^i.  (3.29) 
Note  that  each  side  of  (3.29)  has  all  but  one  term  missing.  The 
decision  rule  then  becomes  to  assign  x  to  a  if,  for  all  j  = 
2>  ’ ' '  ,LJ  =  i 

p{ci  I  x)  >  p(cj  I  x).  (3.30) 

For  the  decision  rule  based  on  (3.27)  to  hold,  the  posteriori 
density  functions  p(a  |  x);  i  =  1,2,  •  •  •  ,L  must  be  known.  In 
practice  it  must  be  estimated  from  the  available  data  set.  To  ob¬ 
tain  the  estimates  of  the  posteriori  density  functions,  neural  net¬ 
works  are  applied  in  the  study  for  the  following  reasons.  First, 
neural  networks  are  universal  approximators  in  the  sense  that 
they  can  theoretically  approximate  any  continuous  input-ouiput 
mapping  to  any  desired  degree  of  accuracy.  Hence,  they  can  be 
used  to  approximate  the  posteriori  function  p{a  \  x).  Secondly, 
neural  networks  are  inherently  nonlinear  in  the  activation  func¬ 
tion;  they  have  the  ability  to  capture  the  underlying  nonlinearity 
ior  the  generation  of  incoming  data. 

IV.  Simulation  Results 
A.  Data  Description 

In  this  section,  the  feasibility  of  the  wavelet-packet-based 
feature  classification  technique  was  examined  through  numer¬ 
ical  simulations  on  a  real  data  set  known  as  the  Westland  data 
set  The  Westland  data  set  [24]  was  chosen  because  it  has  been 


analyzed  by  a  nnmoer  of  other  researchers  and  because  r  ^ 
considered  a  benchmark  data  set  in  the  field.  Tne  vibration  data 
used  for  simulation  is  archived  a:  the  Applied  Research  Labo¬ 
ratory,  Pennsylvania  State  University,  University  Park,  m  this 
ciata  set,  vibration  data  was  recorded  from  an  an  ma-ir  power 
transmission  oi  a  U.S.  Navy  CE-46E  helicopter.  Tne  vibration 
data  were  collected  using  eight  accelerometers  mounted  at  the 
renown  fault  sensitive  locations  or  the  nehconter  aearbox  Thc 
data  were  then  recorded  for  various  seeded  faults  inciudma  the 
no-defect  case,  listed  in  Table  I.  Nine  torque  ievels,  ranging 
from  27%  up  to  100%  and  various  fault  severity  levels  were 
applied.  One  tachometer  was  placed  on  the  aft’ transmission 
m  place  of  the  rotor  position  motor.  Tne  tach  signal  is  a  256 
pulse-per-revolution  signal  with  a  once-per-revoiution  signal 
superimposed  on  it.  Based  on  its  position  in  the  gearbox,  "one 
revolution  describes  a  complete  rotation  of  the  rotor  position 
output,  not  that  of  the  main  shaft.  The  vibration  data  was 
sampled  at  103  116.08-Hz  rate.  With  the  approximate  100-kHz 
sampling  rate,  there  are  between  897—904  samples  within  the 
period  defined  by  the  tachometer  signal. 

B.  Signal  Segmentation 

For  utilization  of  the  fast  WPT  algorithm,  each  of  the  1024 
time-senes  data  points  of  vibration  signal  is  defined  as  a  sample 
vector  to  be  analyzed.  The  reason  for  using  1024  points  is  it 
covers  one  period  defined  by  the  tachometer  period.  It  is  rea¬ 
sonable  to  assume  that  fault  symptoms  can  be  fully  described 
wrthin  the  period.  Fig.  9  shows  two  signal  segments  and' corre¬ 
sponding  power  spectra  for  normal  mode  and  fault  3.  Looking 
at  the  spectrum  of  the  vibration  data  segment,  we  observe  a  Ion" 
flat  region  toward  the  end  of  the  frequency  range;  it  is,  thus,  in¬ 
ferred  that  the  bandwidth  of  the  signal  is  much  less  than  the  sam¬ 
pling  frequency.  Based  on  this  observation,  the  sample  vector  is 
first  downsampled  by  four  to  yield  a  256-point  signal  segment. 
This  lowers  the  computational  complexity  without  losing  much 
information  of  the  original  signal. 

Additionally,  if  a  random  signal  has  a  nonzero  mean,  its 
power  spectrum  has  an  impulse  at  zero  frequency.  If  the  mean 
is  relatively  large,  this  component  will  dominate  the  spectrum 
estimate,  causing  low-amplitude  low-frequency  components  to 
be  obscured  by  the  leakage.  Therefore,  in  practice,  the. mean  ' 
is  often  estimated,  and  the  resulting  estimate  is  subtracted 
om  the  random  signal  before  computing  the  power  spectrum 
estimate.  Although  the  sample  mean  is  only  an  approximate 
estimate  of-  the  zero-frequency  components,  subtracting  it 
from  the  signal  often  leads  to  a  better  estimate  at  neighboring 
frequencies.  [25],  ° 

C.  Generation  of  Training  Data  Set/Testing  Data  Set 

-.There  are  total  of  68  data  sets  available,  which  correspond  to 
nine  different. torque  levels  and  eight-class  conditions.  For  each 

’  ^  faUlt  Signals  avaiiable.  Each  file  contains 

412  464  datapomts.  The  412  464  data  points  are  segmented  to 

,  containing  1024  data  points  each.  In  this 

study,  Ae  first  50  -samples  collected  represent  the  training  data 
se  ^e  the  remainmg  350  samples  are  used  as  the  testing  data 
set  These  unseen  data  sets  to  the  neural  classifier  are  kept  to 
uate^ie  generalization  capability.  In  this  study,  only  the  data 
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TABLE  X 

Classification  Performance  (Whtie  Noise;  SNE  =  o  dE ;  PWH'j 


6—1 

d=2 

d=5 

.0— 

c=c 

d=~ 

d=5 

WPT 

FT 

'  11.  -IT. 

Test  Hit. 
Tx.  Err. 
Test  Bit. 

0.25 

0.5 

14.25 

17.83 

0 

0.08 

1.25 

2.25 

0 

•0.17 

1 

*1  c 

0 

o.os 

0.25 

2.83 

0 

0.08 

0.5 

2.08 

0 

0.33 
'  0 

0  1*7 

0 

0.33 

0 

1  04 

0 

0.42 

0 

1  7* 

TABLE  XI 

CLASSIFICATION  PERFORMANCE  (WHITE  NOISE. 

d— 1  ri—'l  ri-T 

;  SNR 

=  0  dE:  KNK) 

WPT 

FT 

Tr.  Err. 
Test  En. 
Tr.  Err. 
Test  Eit. 

0 

0.25 

1.75 

6.33 

0.25 

0.5 

.1. 

6.08 

0.25 

1.08 

0.25 

3.42 

u— 

0 

0.08 

1 

2.08 

c-5 

0 

0.17 

0.5 

4 

0=0 

0 

0.17 

0 

2:25 

0=/ 

0 

0.25 

0 

2.17 

d=8 

0 

0.42 

035 

1.5 

TABLE  XH 

Classification  Performance  (White  Noise;  SNR  — 

-3  dB:  PWM) 

d-1 

d=2 

d=3 

d=4 

d=5 

d=6 

d=7 

d=S 

WPT 

FT 

Tr.  Err. 
Test  Err. 
Tr.  Err. 
Test  Err. 

0.25 

0.67 

5 

7.25 

0 

0.08 

1 

5.42 

0.25 

0.92 

1.25 

5.92 

0.25 

0.83 

1.25  • 

6.92 

0 

0 

0.75 

4.83 

0.25 

0.17 

1 

4.83 

0 

0.08 

0.25 

6.67 

0 

0.08 

0 

5 

TABLE  Xm 

Classification  Performance  (White  Noise;  SNR  = 

-3  dB;  KNK) 

d-1 

d=2 

d=3 

d=4 

d=5 

d=6 

d=7 

ri=8 

WPT 

FT 

Tr.  Err. 
Test  Err. 
Tr.  Err. 
Test  Err. 

0.5 

1 

16.25 

19 

0 

0.17 

3.25 

10.58 

0 

0.17 

1.5 

5 

0 

033 

1.75 

6 

0.25 

0.75 

1 

4.5 

0  . 

0.08 

0.75 

5.08 

0.25 

1.17 

0.25 

4.83 

0.25 
133. 
o  ; 

4.17 

sets  corresponding  to  torque  level  100%  were  used  for  evalua¬ 
tion. 


D.  System  Description 

In  the  following  simulations,  each  vibration  signal  segment 
is  transformed  into  a  wavelet-packet-based  energy  vector  as  de¬ 
scribed  in  Section  m-B.  The  proposed  two-feature  selection 
method  is  then  employed  to  identify  a  subset  of  feature  compo¬ 
nents  that  will  be  used  as  input  to  the  neural  network  classifier. 
The  steps  are  summarized  as  follows. 

1)  Seven-level  WPD  is  generated  for  each  vibration  signal 
segment. 

2)  Energy-based  feature  measures,  as  discussed  in  Section 
;m-B,  are  then  computed  for  each  WPD  of  signal  seg¬ 
ment  from  step  1.  This  results  in  a  254-dimensional  fea¬ 
ture  vector. 

3)  Identify  a  subset  of  feature  components,  as  discussed  in 
Section  HI-C,  to  form  an  input  vector  for  the  neural  net¬ 
work  classifier. 


£.  Test  Result  Using  One-Sensor  Data 

In  the  following  simulations,  we  conducted  tests  on  features 
extracted  from  both  Fourier-based  and  wavelet-packet-based  for 
assessing  the  applicability  of  wavelet-packet-based  analysis  as  a 
tool  for  vibration  monitoring.  The  Fourier-based  features  are  de¬ 
fined  as  the  power  spectrum  of  a  256-point  signal  segment,  and 
the  result  is  a  129-dimensional  vector  where  each  component 
corresponds  to  one  of  the  129  uniform  frequency  band  energies. 

The  two  feature  selection  processes,  Approaches  1  and  II  as 
described  in  Section  HI-C,  are  applied  on  both  wavelet-packet- 
based  feature  components  and  Fourier-based  feature  compo¬ 
nents  to  select  the  best  discriminant  feature  components.  The 
obtained  feature  components  are  then- used  as  input  to  train  die 
neural  network  classifier.  For  each  of  the  feature  selection  ap¬ 
proaches,  the  eight  highest  discriminant  (d  =  8)  feature  com¬ 
ponents  (out  of  254)  are  used  to  form  the  final  feature  vector,  d 
is  chosen  based  upon  LDA  where  only  the  eight  highest  eigen¬ 
values  are  dominant  in'  the  given  data  set.  Table  H  provides  the 
dimension  of  the  final  feature  vector  for  the  two  approaches.  . 
fit  the  following,  the  feature  selection  method  Approach  I,  as 
described  in  Section  HI-C,  is  designated  as  FWM,  while  the 
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Fig.  11.  Color  noiss  and  its  power  spectrum,  (a)  Color  noise,  (b)  PSD  of  color 
noise. 


Approach  II  is  -designated  as  KNK.  In  general,  the  computation 
cost  for  PWM  will  be  less  than  that  of  KNK. 

The  network  architecture  is  D-D- 8,  where  D  is  the  dimension 
of  the  final  feature  vector.  In  the  training  process,  the  network 
is  trained  until  the  mean-square  error  is  below  0.01,  or  the  max¬ 
imum  number  of  epochs  (10  000)  is  reached  In  practice,  the 
neural  network  will  not  produce  a  perfect  decision,  i.e.,  only 
one  1  in  the  output  neuron  while  others  are  all  05s,  and  might 
produce  values  between  zero  and  one.  Hence,  it  was  decided  to 
use  the  maximum  output  value  as  the  most  likely  fault  condi¬ 
tion  (i.e.,. a  hard  decision).  In  all  simulations,  a  clear  winner  can 
always  be  identified. 

The  classification  results  are  shown  in  Tables  ID- VI.  Note 
that  the  unit  of  error  is  %  in  all  classification  results.  “Jr.  Err. ”  is 
.referring  to  the  training  error,  while  “ Test  Err.  ”  is  referring  to  the 
testing  error.  Note  that  the  performance  of  using  different  sensor 
data  shows  significant  differences.  For  example,  the  testing  er¬ 
rors  of  using  data  from  sensors  5.  6,  and  8  are  relatively  higher 
than  those  of  other  sensors  for  both  the  wavelet-packet-based 
and  Fourier-based  approaches.  It  is,  thus,  inferred  that  some  sen¬ 
sors  are  not  sensitive  to  the  detection  of  specific  fault  symptoms. 
This  suggests  the  need  to  use  multiple-sensor  data  to  search  for 
class-specific  features. 

E  Test  Result  Using  Eight-Sensor  Data  ■ 

In  the  following  tests,  feature  components  from  all  eight  sen¬ 
sors  are  all  used  to  begin  the  -feature,  selection  process,  i.e.,  the 


comparison  of  discriminant  powers  is  conducted  on  features 
coining  from  all  eigh:-sensor-data. 

Table  VC  provides  the  dimension  of  the. final  feature  vector 
for  the  two  approaches  based  on  wavelet  packet  features  and 
Fourier  features,  respectively.  Tne  number  of  features  chosen 
are  with  respect  to  ranking  order  in  the  discnmman:  function, 
as  discussed  in  Section  DI-C.  All  simulation  settings,  network 
architecture  and  MSE  goal,  are  the  same  as  previous  tests.  The 
classification  results  are  displayed  in  Tables  VIU  and  IX  cor¬ 
responding  to  feature  selection  methods  PWM  and  KNK,  re¬ 
spectively.  In  the  tables  which  follow,  FT  refers  to  the  Fourier- 
based  features  while  WPT  refers  to  waveiet-packsi-based  fea¬ 
tures.  Tne  results  show  the  performance  is  much  improved  when 
combining  data  from  all  sensors.  If  we  use  only  one  sensor, 
the  crucial  information  for  the  specific  fault  symptom  may  not 
be  detected  and  the  overall  classification  performance  may  be 
lower.  This  confirms  a  .  general  understanding  that  some  fault 
symptoms  can  only  be  detected  by  some  neighboring  sensors. 
To  qualitatively  analyze  the  parameter  d  with  respect  to  classi¬ 
fication  performance  remains  to  be  investigated  in  the  near  fu¬ 
ture. 

Additionally,  it  is  observed  that  the  performance  of  the 
Fourier-based  approach  shows  slightly  better  results  than 
the  wavelet-packet -based  approach.  It  is  concluded  that  fea¬ 
tures  providing  discriminant  information  may  demonstrate 
narrow-band  frequency  characteristics  in  this  data  set  In 
such  cases,  the  Fourier-based  approach  is  ideally  the  better 
candidate  for  extracting  signal  features.  Recall  that  there  is  a 
slight  amount  of  frequency  overlap  among  the  wavelet  basis 
functions,  thus,  a  particular  frequency  may  be  sensed  by  two 
different  basis  functions.  This  frequency  leakage  may  lead 
to  worse  performance  using  wavelet-packet-based  features. 
Nevertheless,  the  WPT  is  still  able  to  extract  the  essential 
discriminant  features  and  achieve  a  satisfactory  performance. 

G.  Test  on  Data  Corrupted  by  Additive  White  Noise 

A  measured  vibration  signal  can  be  considered  to  have  the 
following  components:  the  fault  response  caused  by  faulty 
equipment;  vibration  from  normal  machine  components;  vibra¬ 
tion  of  neighboring  machinery;  and  measurement  variation.  In 
monitoring  vibration  .signals,  we  considered  the  noise  to  consist 
of  vibration  from  machine  components  (other  than  the  faulty 
.response),  neighboring  machinery,  and  measurement  noise. 
The  presence  of  noise  complicates  the  monitoring  tasks  in  two 
forms:  by  masking  the  signal  of  interest  and  by  increasing 
the  vibration  values  beyond  monitoring  criteria,  when,  in 
fact,  the  component  being  monitored  experiences  no  sign  of 
malfunction.  Note  the  original  Westland  data  were' collected 
fom  a  laboratory  test  stand.  The  data  sets  are  very  clean.  To 
test  further  .the  .feasibility  of  the  wavelet-packet-based  feature 
extraction ‘.technique  on  the  presence  of  noise  (a  realistic  envi¬ 
ronment),  simulated  data  were  artificially  generated  by  adding 
■different  types  of  noise  to- the  original  vibration  signals  [9].  The 
goal  was  to  -investigate  the  robustness  of  wavelet-packet-based 
. features  wvhen  .the  -  data  was  subjected  to  the  presence  of  noise. 
In -the  foil  owing  simulations,  the  signals  are  first  corrupted  with  • 
.artificially  ^generated  noise  under  different  SNR,  then  the  WPT 
- andFT-:are:applied  on  the  corrupted  signal  to  obtain  the  signal’s 
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■  TABLE  XF' 

Classification  Performance  (Color  Noise;  SNR  =  0  dE;  PWM) 


n~  2 

d=2 

d=3 

0=4 

c=f 

d=c 

C=" 

WPT 

Tr.  Bn. 

1 2.5 

0 

.0.25 

0 

0 

0 

0 

G 

Test  Bn. 

13.08 

0.5 

0.25 

o.os 

0.5 

0.25 

0.67 

1.05 

FT 

Tr.  Bn. 

15.75 

2.5 

0.75 

0.75 

0.25 

0.25 

0 

0^5 

_ 

Test  Err. 

18 

5.92 

9.5 

6.83 

4.75 

*  ~  - 

4.33 

TABLE 

:  xv 

Classification  Performance  (Color  Noise 

;  SNR  = 

0  dB:KNK) 

d=l 

d=2 

d=3 

ds =4 

d— 5 

d-6 

ri=7 

d=5 

WPT 

Tr.  Bn. 

0.5 

0.75 

0 

0 

0 

0 

0 

0 

Test-En. 

3 

2.42 

1.17 

0.56 

0.42 

0.75 

1 

FT 

Tr.  Bn. 

2.25 

1.5 

0.75 

0.75 

0 

0.75 

0 

0 

Test  En. 

7.08 

7.08 

2.42 

3.75 

3425 

2.17 

■2.83  ■ 

4.17 

.  TABLE 

XVI 

Classification  Performance  (Color  Noise;  SNR  =  - 

-3  dB;  PWM) 

d-1 

d=2 

d=3 

d=*4 

d=5 

d-6 

d=7 

6=8 

WPT 

Tr.  En. 

12.5 

0 

0 

0 

0 

0 

0 

0 

Test  Err. 

13.08 

0.08 

0.08 

0.42  • 

025 

0.25 

0.17 

025 

FT 

Tr.  Err. 

27 

4.5 

2 

1.25 

1.25 

0.5 

0.25 

0.5 

Test  En. 

30.25 

10.83 

16.08 

14.42 

13.08 

12.08 

11.92 

13.17 

TABLE 

xvn 

Classification  Performance  (Color  Noise;  SNR  =  - 

“3  dB;  KNK) 

d-1 

d=2 

d=3 

d=4 

d=5 

d=6 

d=7 

d=8 

WPT 

Tr.  En. 

0.25 

0.25 

0 

0 

0^ 

0 

025 

0 

Test  Err.. 

0.75  ’ 

0.83. 

•  0.58 

0.67 

i.58 

033 

1.67. 

2.33 

FT 

Tr.  En. 

7.25 

3.25 

'  2.25 

1 

0.5 

1.75 

025 

1 

Test  Err. 

10 

9.83 

9.33 

11.17 

10 

13.33 

10.92 

11.83 

time-frequency  feature.  Finally,  the  proposed  feature  selection 
method  is  used  to  identify  discriminant  feature  components 
that  will  be  used  as  input  to  train  a  neural  network  classifier.  In 
this  study  we  use  three  types  of  noise  to  model  the  vibration 
signal  other  than  the  signal  being  monitored. 

The  first  type  of  noise  model  used  is  white  Gaussian  noise, 
where  no  frequency  is  dominating,  as  shown  in  Fig.- 10.  This 
noise  has  an  AR(0)  model 

Xk  =  ak  (4.1) 

where  a*  are  normally  distributed  with  zero  mean  and  variance 
aa *  I31  study,  we  use  the  Matlab  function  randn( )  to  gen¬ 
erate  ak. 

Tables  X-XIH  show  the  results  under  different  SNR.  The 
results  reveal  that  the  wavelet-packet-based  approach  demon¬ 
strates  better  results  than  the  Fourier-based  approach.  It  was  also 
observed  that  the  difference  of  performance  between  the  wavelet 
packet  approach  and  the  Fourier-based  approach-is  even  higher 
when  the  noise  power  is  increased. 


H.  Test  on  Data  Corrupted  by  Additive  Color  Noise 

The  second  type  of  noise  used  to  corrupt  original  data -is  col¬ 
ored  noise  where  a  group  of  frequencies  is  dominant,  as  shown 
in  Fig.  11.  Such  a  noise  can  be  generally  represented  by  an 
ARMA(n,  n  -  I)  model  [26] 

Xk  =  (friA.k-l  +  <j> 2-£fc_2  -f  *  *  *  -r  4>nXk -n 

-  ha>k-i  -  02ak-2 - 0n-i afc-n+i.  (4.2) 

Coefficients  4>i  and  0*  determine  the  center  frequency  and  band¬ 
width  of  the  noise.  The  ak  are  normally  distributed -with  zero 
mean  and  variance  In  our  tests,  however,  we  generated  such 
a  noise  by  convolving  a  white-noise  sequence  with  a  bandpass 
filter.  We  generated  the  -colored  noise  such  that  the-  dominant 
frequencies  lie  between  the  digital  frequency  band  0  to  0.257T. 
This  is  the  band  where  the  original  signal  contains  most  of  its 
energy,  as  can  be  seen  from  Fig.  9.  Tables  XIV  through  XVH 
show  the  classification  results  of  conducted  simulations  corre¬ 
sponding  to  different  SNR.  In  all  simulations,  better  results  are 
obtained  via  the  wavelet-packet-based  approach. 
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TABLE  XVIL 

Classification  Performance  (Pint  Noise;  SNR  =  0  dB;  ?WM) 


c=: 

d=2 

d=3 

d=6 

c=' 

rt&v 

WPT 

Tr.  nr.  0.5 

0 

0 

0 

0 

0 

0 

o 

Test  Eu.  C.5 

0.08 

0.67 

0 

1.33 

o.os 

0.33 

O.OS 

FT 

Tr.  nr.  0.5 

1.25 

0 

0.25 

0 

0  • 

0 

0 

_ 

Test  Err.  1.33 

2.58 

'  2.25 

1.5 

2.5 

0.92 

0.42 

0.5 

TABLE 

xdl 

Classification  Performance  (Pent  Noise;  SNR  : 

=  0  dB;  KNK) 

0=1 

c=: 

d=3 

d=4 

d=5 

a=6 

d=7 

c=8 

WPT 

Tr.  Err.  0 

0525 

0.25 

0 

0525 

0 

0 

o 

Test  Err.  0.83 

0.92 

0.17 

0.42 

0225 

0.58 

0 

0.17 

FT 

Ti.  Err.  1.75 

0.5 

0.5 

0.25 

0 

0.25 

0.25 

0 

Test  Err.  3.58 

4.58 

1.92 

2 

1.91 

0.67 

1.42 

0.83 

TABLE 

XX 

Classification  Performance  (Pinic  Noise;  SNR.  = 

—  3  dB;  PWM) 

0=1 

d=2 

d=3 

d=4 

d=5 

d=6 

d=7 

d=8 

WPT 

Tr.  Err.  0.5 

0525 

0 

0 

0 

'  0525 

0.25 

0525 

Test  Err.  1.83 

0.33 

0.33 

0.08 

0.5 

0.42 

1.42 

0.42 

FT 

Tr.  Err.  2.5 

1 

0.75 

0 

0.5 

0 

0 

0 

— 

Test  Err.  5.67 

3.25 

4.08 

3.17 

4.17 

1.33 

2.5 

1.17 

TABLE 

XXI 

/ 

Classification  Performance  (Pinic  Noise;  SNR  = 

-3  dB;  KNK) 

d=l 

d=2 

d=3 

d=4 

d=5 

d=6 

d= 7 

d=8 

WPT 

.  Tr.  Err.  0.75 

0 

0 

0 

0 

•  0.25 

0.25 

0.25  . 

Test  Err.  1.25 

0.17 

1.25 

0.5 

0.17 

0.75 

0 

0.17 

FT 

Tr.  Err.  1525 

1525 

0.5 

0.75 

0.5 

‘  0 

0 

o 

Test  Err.  7 

5.25  1 

3.5 

4.17 

3.08 

3.42 

3.17 

2.83 

I.  Test  on  Data  Corrupted  by  Additive  Pink  Noise 

The  third  type  of  noise  employed  is  pink  noise,  where  power 
decreases  as  frequency  increases,  as  depicted  in  Fig.  12.  It  can 
.be  expressed  by  an  AR(1)  model 


(4.3) 


where 


&k  2re  normally  distributed  with  zero  mean  .  and 
.2  - 


variance  In  the  test,  fa  is  set  to  be  0.95,  and  resulting 
noise  is  displayed  in  Fig.  12.  The  test  results  are  shown 
in  Tables  XVHl— XXI.  Again,  it  is  confirmed  that  the 
wavelet-packet-based  approach  produces  better  results. 

J.  Discussion  of  Test  Results 

By  examining  Tables  HI  and  IV,  where  only  one  sensor  is  used 
for  the  searching  of  class-specific  feature  components,  it  is  clear 
that  some  sensors  provide  litde  class  separability  information  in 
the  sense  of  frequency  analysis. ’This- indeed  confirmed  our  un¬ 
derstanding  that  the  faulted  symptom  is  often  localized  and  can 
only  be  detected  by  neighboring  sensors.  It  suggests  that  data 
collected  from  multiple  sensors  will  provide  better  classification 
information  and  lead  to  better  performance.  From  the  results  of 


simulation  on  the  original  data  set ,  it  is  observed  that  no  im¬ 
provement  is  made  through  the  wavelet-packet-based  approach 
on  this  data  set  and,  in  several  cases,  it  is  even  slightly  worse  than 
the  Fourier-based  approach.  As  mentioned  before,  this  coulc} 
be  due  to  the  “overlap”  of  frequency  content  among  wavelet 
packet  basis  functions.  Nevertheless,  the  wavelet-packet-based 
approach  shows  very  promising  results  in  a  realistic  environ¬ 
ment  for  which  the  data  are  corrupted  by  noise. 

V.  Conclusion 

"This  paper  has  investigated  the  feasibility  of  applying  the 
WPXto  the  classification  of  vibration  signals.  Using  the  WPT,  a 
nch.  collection  of  time-frequency  characteristics  ink  signal  can 
bw  obtained  nnd  examined  for  classification  purposes.  In  this 
paper,  we  detailed  an  innovative  feature  selection  process  that 
exploits -signal  class  differences  in  the  wavelet  packet  node  en¬ 
ergy, -This  oresults  in  a  reduced-dimensional  feature  space  com¬ 
pared  to ‘the.  dimension  of  the  original  time-series  signal.  The 
^waveletrpacket^based  features,  obtained  by  our  method  for  vi¬ 
rion  signals,  yields  nearly  100%  correct  classification  when 
.used^dnjmtrto^  neural  network  classifier. 
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were  utilized.  Nonetheless,  the  improved  time-frequency 
resolution  oi  the  WPT  is  significant  when  we  are  confronted 
with  signals  corrupted  by  artificially  synthesized  noises.  In  the 
extended  tests,  the  wav eie t-p acket-b as e d  apuroach  showed  very* 
promising  results  compared  to  the  Fourier-based  approach. 
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Fig.  12.  Pink  noise  and  its  power  spectrum,  (a)  Pink  noise,  (b)  PSD  of  pink 
noise. 


In  Section  II,  we  reviewed  the  Founer-based  analvsis  on 
the  extraction  of  frequency  information  from  a  signal  and 
discussed  the  possible  inherent  drawbacks  due  to  its  fixed 
time-frequency  resolution.  The  WPT  that  overcomes  the  fixed 
time-frequency  resolution  was  then  presented.  To  alleviate  the 
•  time-variant  characteristics  of  the  WPT  coefficients,  wavelet 
packet  node  energy  was  used  as  an  essential  time-frequency 
feature  measure  of  the  signal.  Although  the  wavelet  packet 
node  energy  provided  us  with  a  multiresolution  view  of  a 
signal,  it  simultaneously  introduced  a  higher  dimension  space 
compared  to  the  original  time-domain  signal.  To  reduce  the 
dimensionality,  it  was  shown  that  LDA  had  some  practical 
problems  when  the  feature  dimension  was  relatively  high 
compared  to  the  number  of  collected  samples  since  it  involved 
calculation  of  the  inverse  of  the  covariance  matrix.  In  such 
a  case,  two-feature  selection  criteria  based  on  measures  of 
the  overlap  of  the  conditional  probability  density  function 
among  different  classes  were  proposed  to  avoid  the  possible 
numerical  problems  as  presented  in  Section  HI.  In  Section 
IV,  the  proposed  wavelet-packet-based  classification  system, 
which  combined  a  wavelet-packet-based  feature  extractor 
and  a  neural  network  classifier,  was  tested  on  a  real  data  set 
known  as  the  Westland  data  set.  Numerically,  it  was  observed 
that  significant  improvement  can  be  achieved  when  using 
multiple  sensor  data.  This  validated  our  understanding  that 
a  faulted  symptom  is  localized  and  can  only  be  detected  by 
the  neighboring  sensors.  Both  the  Fourier-based  features  and 
wavelet-packet-based  features  achieved  excellent  classification 
results  on  the  original  Westland  data  set  when  all  eight  sensor 
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Abstract 

The  validation  of  sensor  measurements  has  become  an  integral  part  of  the  operation  and  control  of  modem  industrial 
equipment.  The  sensor  under  harsh  environment  must  be  shown  to  consistently  provide  the  correct  measurements. 
Analysis  of  the  validation  hardware  or  software  should  trigger  an  alarm  when  the  sensor  signals  deviate  appreciably 
from  the  correct  values.  Neural  network  based  models  can  be  used  to  on-line  estimate  critical  sensor  values  when 
neighboring  sensor  measurements  are  used  as  inputs.  The  underlying  assumption  is  that  the  neighboring  sensors  share 
an  analytical  relationship.  The  discrepancy  between  the  measured  and  predicted  sensor  values  may  then  be  used  as  an 
indicator  for  sensor  health.  The  proposed  Winner  Take  All  Experts  (WTAE)  network  based  on  a  ‘divide  and  conquer 
strategy  significantly  reduces  the  computational  time  required  to  tram  the  neural  network.  It  employs  a  growing  fuzzy 
clustering  algorithm  to  divide  a  complicated  problem  into  a  series  of  simpler  sub-problems  and  assigns  an  expert  to 
each  of  them  locally.  After  the  sensor  approximation,  the  outputs  from  the  estimator  and  the  real  sensor  readings  are 
compared  both  in  the  time  domain  and  the  frequency  domain.  Three  fault  indicators  are  used  to  provide  analytical 
redundancy  to  detect  the  sensor  failure.  In  the  decision  stage,  the  intersection  of  three  fuzzy  sets  accomplishes  a  decision 
level  fusion,  which  indicates  the  confidence  level  of  the  sensor  health.  Two  data  sets,  the  Spectra  Quest  Machinery 
Fault  Simulator  data  set  and  the  Westland  vibration  data  set,  were  used  in  simulations  to  demonstrate  the  performance 
of  the  proposed  WTAE  network.  The  simulation  results  show  the  proposed  WTAE  is  competitive  with  or  even  superior 
to  the  existing  approaches.  ©  2001  Elsevier  Science  Ltd.  All  rights  reserved. 

Keywords:  Sensor  validation:  Neural  network;  Winner  take  all;  Divide  and  conquer 


1.  Introduction 

Sensor  validation  has  become  an  essential  part 
of  the  operation  and  control  of  modem  industrial 
equipment,  which  completely  relies  on  the  correct 
sensor  measurements.  On  one  hand,  in  order  for  a 
situation  to  be  fully  comprehended  and  controlled, 
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reliable  data  acquisition  and  interpretation  is  of 
the  utmost  importance  in  modem  decision-making 
process.  On  the  other  hand,  an  increasing  func¬ 
tionality  in  control  system  makes  it  further  com¬ 
plicated  and  costly  to  accomplish  the  task  of 
sensor  validation  and  diagnosis  [1].  Control  systems 
for  critical  plants,  whose  operations  must  not  be 
interrupted  for  safety  reasons,  are  often  configured 
with  redundant  sensors  to  provide  fault  tolerance 
and  to  ensure  the  required  degree  of  safety.  As  a 
result,  the  redundant  sensors  once  again  increase 
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the  cost  and  weight  of  the  whole  system.  To  achieve 
substantial  savings  on  hardware  redundancy,  and.  at 
the  same  time,  to  meet  the  requirements  of  reliable 
and  accurate  sensor  measurements  for  the  modem 
control  systems,  an  intelligent  sensor  validation 
scheme  based  upon  analytical  redundancy  is  proposed 
in  this  paper. 

Neural  network  based  estimators  can  be  used  to 
observe  critical  sensor  values  when  multiple 
neighboring  sensor  measurements  are  used  as 
additional  inputs.  This  framework  is  based  upon 
the  assumption  that  physically  close  sensors  are 
analytically  related.  The  estimated  sensor  value 
may  also  be  used  as  synthetic  data  in  the  event  of  a 
sensor  failure,  and  is  often  called  a  soft  sensor  or 
inferential  sensor.  Application  of  fuzzy  set  theory 
for  the  decision  making  process  is  advantageous, 
because  the  soft  decision  boundaries  in  fuzzy  logic 
environments  result  in  flexible,  more  human-like 
decisions.  The  information  that  the  validation 
system  deals  with  is  often  fuzzy,  obscure  rather 
than  precise  in  nature,  and  fuzzy  set  theory  allows 
us  to  model  this  imprecision  appropriately  and 
later  permits  us  to  reason  in  a  linguistic  language 
[2].  The  proposed  sensor  validation  scheme, 
synergistically  integrates  the  neural  network  and 
fuzzy  logic,  provides  a  reliable  health  indicator  for 
the  critical  sensor  of  interest. 

For  the  completeness  of  the  presentation,  the 
remainder  of  this  paper  is  organized  as  follows.  In 
Section  2,  we  introduce  the  proposed  Winner  Take 
All  Experts  (WTAE)  network  architecture  and  the 
validation  algorithm  along  with  a  brief  literature 
review.  In  Section  3,  we  analyze  the  performance 
of  the  architecture  on  two  benchmark  problems. 
Finally,  Section  4  provides  the  conclusion  of  the 
article  and  some  pertinent  observations. 

2.  The  Winner  Take  All  Experts  network 

A  sensor  is  declared  faulty  when  it  displays  a 
non-permitted  deviation  from  the  characteristic 
properties  of  the  objects  it  is  monitoring.  In  most 
cases,  sensor  validation  scheme  uses  a  number  of 
hardware  sensors  provided  a  redundancy  of  infor¬ 
mation  monitoring  each  important  system  para¬ 
meter,  from  which  a  more  reliable  value  was 


extracted.  Although  hardware  redundancy  is  pop¬ 
ular  and  solves  many  sensor  validation  problems, 
it  possesses  some  obvious  disadvantages  [3]. 

•  The  expense  of  the  redundant  sensors,  and 
the  installing,  maintaining  of  them  for  each 
important  observation  greatly  increase  the 
cost  of  the  system. 

•  Even  redundant  sensors  are  in-place;  all  are 
sensitive  to  the  common  failures  and  likely  to 
fail. 

The  technology  from  artificial  intelligence 
helped  to  establish  Knowledge  Based  System 
(KBS)  that  diagnoses  sensors  utilizing  system 
operational  knowledge  in  validating  sensor  values 
[4].  The  knowledge-based  system  is  efficient  and 
helpful  to  yield  valuable  clues  to  the  fault  symptoms 
and  their  locations.  But  it  demands  an  accurate 
model  with  a  comprehensive  understanding  of  the 
basic  physics  and  the  knowledge  of  all  the  potential 
variables,  which  are  often  unavailable,  if  not 
impossible,  in  most  cases. 

The  primary  tracking  algorithms  used  to  estimate 
the  nominal  sensor  measurement  are  Kalman  Filter 
and  Extended  Kalman  Filter.  The  fundamental 
assumptions  of  the  Kalman  Filter  are  that  the  linear 
plant  equations  are  known  and  have  zero  mean 
white  noise  with  known  covariance.  These  con¬ 
straints  make  this  approach  difficult  to  model 
nonlinear  system  and  vulnerable  to  correlated 
noise.  The  Extended  Kalman  Filter  tries  to  linearize 
the  nonlinear  functions,  but  it  is  sensitive  to  the 
accuracy  of  the  initial  conditions  [5]. 

Neural  network  based  models  can  be  used  to 
estimate  critical  sensor  values  when  other  sensors’ 
measurements  are  used  as  inputs.  The  basic  pre¬ 
mise  behind  this  framework  is  that  the  sensed 
plant  variables  are  not  independent  of  each  other 
[6].  The  most  popular  neural  network  in  use  today 
is  the  Multi-Layer  Perceptron  (MLP)  network. 
The  feasibility  of  using  MLP  network  in  sensor 
validation  has  been  studied  since  1990  [7,8].  A 
modified  MLP  network  was  implemented  for 
hardware  based  on-line  learning  soft  sensor  in 
1998  [9]. 

A  Radial  Basis  Function  (RBF)  neural  network 
was  investigated  by  Wheeler  and  Dhawn  [10]  for 
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sensor  signal  tracking.  The  network  used  the 
b -means  clustering  algorithm  for  placement  of  the 
basis  function  centers.  The  result  showed  that 
RBF  network  is  good  at  tracking  the  sensor 
signal  when  it  varies  slowly,  but  a  general  RBF 
network  could  not  accurately  learn  the  data  and 
became  unstable  when  the  dynamics  becomes 
more  complex. 

Traditional  neural  networks  such  as  MLP  and 
RBF  networks  have  proved  successful  as  universal 
function  approximators  and  have  been  used  in 
various  problems,  but  the  training  algorithms  are 
typically  too  slow  for  solving  real-world  problems 
in  real  time.  In  addition,  when  the  problem 
becomes  complicated,  most  of  the  systems  could 
not  even  converge  to  a  local  minimum  in  a  rea¬ 
sonable  time  due  to  hardware  limitation  and  the 
inefficiency  of  the  learning  rule.  Motivated  by  such 
concerns,  a  number  of  researchers  have  investi¬ 
gated  methods  of  function  approximation  incor¬ 
porating  ideas  from  the  communities  of  statistics 
and  artificial  intelligence  [11].  The  general 
approach  is  to  divide  a  complicated  problem  into 
several  simpler  sub-problems  and  assign  a  func¬ 
tion  approximator  or  'expert’  to  each  sub-problem 
locally  [12]. 

If  a  set  of  training  cases  may  be  naturally  divi¬ 
ded  into  subsets  that  correspond  to  distinct  sub¬ 
tasks,  interference  can  be  reduced  by  using  a 
system  composed  of  several  different  'expert’  net¬ 
works  plus  a  fuzzy  membership  clustering  network 
that  decides  which  of  the  experts  should  be  used 
for  each  training  case  [13].  A  system  of  this  kind 
can  be  used  only  when  the  division  into  subtasks  is 
known  prior  to  training  (e.g.  along  the  operating 
regimes  or  trajectories).  The  proposed  WTAE 
network  is  a  modular  architecture  that  works  on 
the  principle  of  'divide  and  conquer’.  The  model 
employs  a  growing  fuzzy  clustering  method  to 
divide  the  input  space  into  overlapping  operating 
regions  on  which  'experts’  act,  and  a  fuzzy 
membership  clustering  network  to  weight  these 
experts  to  form  an  overall  network  output.  The 
idea  behind  such  a  system  is  that  the  growing 
fuzzy  clustering  algorithm  allocates  a  new  case  to 
one  of  the  experts,  and,  if  the  output  is  incorrect, 
the  weight  adaptations  are  localized  to'  this 
expert. 


2.1.  Network  circhirecnire 

The  proposed  WTAE  network  architecture  is 
shown -in  Fig.  1.  Without  loss  of  generality,  each 
expert  network  is  a  typical  single  hidden  layer 
MLP  network.  The  networks  input  vector  consists 
of  the  signals  from  each  related  sensor  that  is 
located  nearby  the  critical  sensor  of  interest.  The 
outputs  from  every  expert  network  form  an  output 
vector  y  =  [i’i.y2. ....  r.\/],  where  M  denotes  the 
number  of  expert  networks.  The  vector  is  simulta¬ 
neously  fed  into  the  fuzzy  membership  clustering 
network,  which  produces  the  membership  level,  z/,(.v) 
€  [0,  1],  where  i  =  1, ....  M.  Each  membership  level 
corresponding  to  each  expert  network  altogether 
fomis  a  membership  vector  /x  =  [mi ,  M2.  -  M/u]- 

The  fuzzy  membership  clustering  network  is  a 
single  layer  network  with  a  Gaussian  output  non¬ 
linearity.  Let  the  ith  center  and  the  ith  variance  of 
the  fuzzy  membership  level  network  be  C,  = 
[C,i ,  Ca . Cm]  and  ay  -  [an.  on*  °>nI  respec¬ 

tively,  where  N  is  the  dimension  of  input  space. 
The  corresponding  output  M/W)  after  the  Gaussian 
non-linearity  is  given  by: 

N  (xrctd 

M/(-v)  =  fIe  :<7/  •  d) 

7=1 

The  Gaussian  function  is  to  ensure  that 
Hi  e  [0,  1],  and  if  x  belongs  to  ith  cluster, 

||  A'  —  C/||  <  |  A'  —  Cj  ||  flj  >  It jy  (2) 

for  j  =  1,  2, ...,  i  -  1,  /+  1, ...,  M.  |HI  is  referred  to 
as  the  Euclidean  norm.  Thus,  the  zth  expert’s 
influence  is  localized  to  a  region  around  Cj.  A 
Winner  Take  All  function  is  denoted  by 
a  —  compet(/x),  which  takes  one  input  argument 
M:  and  returns  an  output  row  vector  with  1  at  tire 
zth  element  having  maximum  membership  value 
i  =  cirgmax(fij{x)),  and  0  elsewhere.  The  Winner- 

Take  All  function  selects  the  winner  expert  net¬ 
work  outputs  to  form  the  overall  estimated 
outputs  by  v(a')  =  y-c? . 

Finally,  the  output  from  the  estimator  and  the 
real  sensor  are  compared  both  in  the  time  domain 
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Fig.  1.  WTAE  network  architecture. 


and  frequency  domain.  Three  fault  indicators  are 
used  to  provide  the  necessary  redundancy  to  identify 
sensor  failure.  The  first  validation  gate,  the  sensor 
value  validation  gate,  compares  the  tracking  output 
directly  with  the  real  sensor  data.  In  the  second 
validation  gate,  the  residual  of  the  two  time  series 
is  investigated  using  the  autocorrelation  coeffi¬ 
cient.  The  power  spectrum  density  of  thetwo  signals 
is  compared  in  the  last  validation  gatq.  A  decision 
level  fusion  of  the  three  validation  gate§\is  accom¬ 
plished  by  the  intersection  of  the  three  fuzzy  sets 
(to  be  further  discussed  in  Section  2.3).  The  output 
from  the  decision  stage  shows  the  sensor  health 
confidence  level  of  the  critical  sensor,  which  will 
serve  as  an  indicator  for  the  human  operator  to 
take  necessary  actions  in  order  to  remedy  or  ame¬ 
liorate  the  sensor  fault,  if  necessary. 

22.  Training  algorithm 

The  WTAE  network  is  trained  using  the  Growing 
Fuzzy  Clustering  algorithm,  where  the  data  is 
trained  sequentially.  The  network  will  con¬ 


tinuously  add  local  experts  that  contribute  to  the 
final  estimate  in  the  off-line  training  step.  The  growth 
criterion  is  essential  that  if  the  new  expert  contributes 
little  to  the  output,  then,  not  only  does  the  complex¬ 
ity  of  the  network  increase  unnecessarily,  but  it  adds 
to  the  computational  burden.  This  leads  to  the  ques¬ 
tion  of  how  the  network  growth  must  be  regulated. 
Fig.  2  shows  the  training  process  used  in  the  WTAE 
network,  where  the  training  data  is  received  sequen¬ 
tially.  The  networks  begin  with  no  expert  network. 
The  first  observation  (x1,  Z1).,  where  x1  is  the  input 
and  tl  is  the  corresponding  target  output,  is  used  to 
initialize  the  first  fuzzy  membership  cluster  char¬ 
acterized  by  (Cj, <7i),  by  setting  C\  -xl  and  a\  a 
unity  vector.  The  degree  of  belonging  for  input  x1  to 
the  first  cluster  is  measured  by 

* 

At  the  same  time,  the  first  sample  is  stored 
locally  in  the  first  training  set  corresponding  to  the 
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Fig.  2.  WTAE  network  training  algorithm. 


first  fuzzy  cluster  defined  above.Next,  a  simple 
architecture  MLP  expert  network  is  built  up  and 
trained  by  the  first  training  set.  The  structure  of 
the  MLP  is  determined  by  the  complexity  of  the 
problem.  The  training  goal,  which  decides  the 
stopping  criteria  for  the  MLP  training,  is  deter¬ 
mined  by  the  mean  square  error  between  the  esti¬ 
mator  and  the  real  sensor  output  (target)  we  can 
tolerate. 

As  observations  are  continuously  received,  the 
network  grows  by  adding  new  expert  networks,  if 
necessary.  The  decision  to  add  a  new  expert  for  an 
observation  ^ )  depends  on  its  novelty,  for 
which  the  following  two  criteria  must  be  met: 

\\^-q\\>£i.  (4a) 

e*  =  fr*  —  vfc||  ^£2,  (4b) 

where  C*  is  the  nearest  fuzzy  cluster  to  xJ:  in  the 
input  space,  vk  is  the  resulted  output  from  WTAE 
network,  and  £1,  £2  are  user- specified  thresholds. 
The  first  criterion  decides  that  the  input  must  be 
far  away  from  the  existing  fuzzy  clusters,  and  the 
second  criterion  says  that  the  error  between  the 
network  output  and  the  target  value  must  be  sig¬ 
nificant.  The  criterion  £1  represents  the  scale  of 
resolution  in  the  input  space.  On  the  other  hand. 


the  value  £2  is  chosen  to  represent  the  desired 
accuracy  of  the  network  output  [14].  In  practice, 
the  trending  approach  should  be  considered  to 
maintain  the  stability  [15]. 

When  a  new,  M  +  1th  expert  network  is  added 
to  the  WTAE,  the  parameters  associated  with  this 
local  expert  network  are  assigned  as  follows: 

Cm+ 1  = 

0>/+ 1.1  =  <?M+ 1.2  =  — .  =  °M+\.N  =  1. 

'  '  At  (yWI/) 

I'  M+i  MIe  :<W'  ’ 

7=1 

while  the  new  sample  set  Sm+ 1 1 

Sm+  1  =  {**}• 

Also,  the  new  expert  network  is  built  up  and 
trained  by  the  sample  set  Sm+  i- 
When  the  observation  (xJ\  tk)  does  not  satisfy  the 
two  novelty  criteria,  the  Resilient  Back-propagation 
training  algorithm  [16]  is  used  to  adapt  the  winning 
MLP  expert  network  parameters,  while  other  losing 
experts  remain  unchanged.  The  fuzzy  center  is 


(5a) 

(5b) 

■  (5c) 

(6.) 
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updated  by  calculating  the  mean  of  the  winning 
sample  set  Sf: 


where  R  is  the  number  of  the  samples  in 

S*  =  | A*) . ,  sj . A'f.).  The  variance  of  the  fuzzy 

cluster  is  determined  by  both  the  previous  var¬ 
iance  and  the  current  statistic  variance  of  the 
updated  sample  set  S*  =  {S*.  a1'}.  We  use  a 
momentum  to  combine  these  two  factors  together: 

1  R 

af(t)  =  ycrf(t-  1)  4-  (1  -  y)- TV  -  C*f-  (8) 

When  the  first  several  samples  are  added  to  the 
sample  set  Si ,  the  statistical  variance  will  be  close 
to  zero.  The  momentum  helps  to  avoid  the  risk  of 
zero  variance  and  also  keep  the  statistical  char¬ 
acteristics  of  the  variance. 

The  overall  output  is  calculated  based  on  a 
winner  take  all  rule.  First,  the  whole  output  p 
from  the  fuzzy  clusters  goes  through  the  competition 
layer  where  each  neuron  excites  itself  and  inhibits 
all  the  other  neurons.  The  transfer  function  of  the 
competition  layer  is  defined  as  shown  below 

ci  =  compet(/i,).  (9) 

It  works  by  finding  the  index  i*  of  the  neuron 
with  the  largest  net  input,  and  setting  its  output  to 
1  (with  ties  going  to  the  neuron  with  the  lowest 
index).  All  other  outputs  are  set  to  0, 

ai  =  |  q’  \  ,  where  V;  ^  i*.  (10) 

Next,  the  overall  output  from  the  estimator,  v,  is 
calculated  as  follows. 

y=y-aT.  (ll) 


The  last  step  is  executed  after  all  the  training 
samples  were  trained.  The  number  of  samples  in 
each  sample  set  could  be  dramatically  different. 
So.  the  purpose  of  the  last  step  is  pruning  the 
expert  that  contains  very  little  training  samples 
and  re-classifying  these  samples  to  nearest  fuzzy 
cluster  based  upon  Euclidean  measures. 

2.3.  Sensor  failure  mode  detect':  i  algorithm 

The  first  part  of  the  WTA  E  network  provided  us 
with  the  estimated  critical  sensor  value  when  other 
sensors1  measurements  are  used  as  additional 
inputs.  To  validate  a  sensor  signal,  the  validation 
algorithm  should  trigger  an  alarm  when  the  sensor 
signal  significantly  deviates  from  the  corrected  value. 

In  order  to  build  up  the  validation  gates,  the 
characteristics  •  of  both  the  sensor  and  the  mon¬ 
itored  operating  environment  have  to  be  compre¬ 
hended.  The  validation  scheme  proposed  in  this 
paper  contains  both  time  domain  and  frequency 
domain  sensor  failure  detection  by  using  three 
fuzzy  sensor  validation  gates.  So,  in  the  learning 
stage,  the  statistical  information  both  in  the  time 
domain  and  the  frequency  domain  has  to  be 
learned  and  recorded  by  three  memories.  The  first 
one  is  used  to  store  the  minimum,  maximum,  and 
mean  value  of  the  sensor  signal  in  the  training 
data  set.  The  other  two  memories  are  used  to 
record  the  variation  of  autocorrelation  coefficients 
r  of  the  residual  of  the  sensor  signal,  and  power 
spectrum  p  of  the  time  series.  A  time  window  was 
specified  in  order  to  calculate  r  and  p,  which  are 
defined  as  follows: 

Definition  1.  For  a  series  of  data 
{a**  :  /c=  1 . K\ ,  the  72th  auto-covariance  coeffi¬ 

cient  is  defined  as: 

f  (**  -  ?)(**-"  -  *)/K»  (12) 

/<=/?+ 1 

where  x  is  the  sample  mean: 


After  the  winner  is  selected  by  the  fuzzy  mem¬ 
bership  function,  the  only  MLP  that  has  to  be 
calculated  is  the  winning  expert  network. 


(13) 
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Then  the  nth  auto-correlation  coefficient  is 


Definition  2.  For  a  series  of  data,  k  — 

1 . K]  can  be  represented  by  its  Fourier  series: 

xk  =  (15) 

n= 0 

where  Cn  are  the  Fourier  coefficients.  Then,  Power 
Density  Spectrum  Coefficient  can  be  defined  as: 

Pn  =  IIC, If'-  (16) 

After  each  r  and  p  corresponding  to  each  time 
window  in  the  learning  data  set  is  available,  the 
time  series,  the  auto-correlation  coefficients,  and 
the  power  spectrum  density  of  the  sensor  signal 
are  learned  and  analyzed  with  the  eight  commonly 
found  sensor  failure  modes  [16],  as  shown  in  Fig.  3. 
.v-Axis  and  y-axis  are  referred  to  as  the  time  and 
vibration  accelerations  in  the  first  column  (Sensor 
Signal),  time  and  autocorrelation  coefficients 

[defined  in  Eq.  (14)]  in  the  second  column  (Residual 
Autocorrelation),  and  frequency  and  power  density 
spectrum  coefficients  [defined  in  Eq.  (16)]  in  the 
third  column  (Power  Spectrum).  It  is  worth  noting 
that  by  inspecting  the  signatures  from  different 
feature  spaces,  it  is  easier  to  identify  a  good  sensor 
from  a  faulted  one. 

In  Fig.  3,  the  first  column  shows  the  eight  failure 
modes’  signals  compared  to  the  nominal  state  signal 
in  the  time  domain.  The  other  two  columns  are  used 
to  compare  the  autocorrelation  coefficients  of  the 
residual,  and  the  power  spectrum  density  for  the 
interested  frequency  range  (for  example,  in  the 
Westland  data  set,  3-10  kHz).  It  is  also  clear  from 
the  figure  that  using  signatures  from  only  one  fea¬ 
ture  space  is  not  sufficient  to  detect  the  normal 
signal  from  all  the  other  failure  modes’  signals. 
However,  it  is  much  robust  to  detect  the  faults  by 
combining  the  comparison  results  from  all  three 
feature  spaces.  In  order  to  measure  the  distances 
between  objects  or  points  in  the  feature  space,  a 


distance  measure  is  used.  There  are  a  number  of 
distance  metrics  that  can  be  used  as  a  tool  to 
measure  a  similarity  between  vectors,  for  example. 
Euclidean  distance.  Mahalanobis  distance,  and 
Minkowski  distance.  Without  loss  of  generality,  to 
calculate  the  difference  between  the  signatures,  the 
Manhattan  distance  [20]  is  chosen  as  the  distance 
between  two  vectors  measured  along  orthogonal 
axes. 

cl  -  ||:t|  -  yi  ||  +  1*2  ~  y~  |  +  -  +  l!-v«  -  )'n  I,  * 

(17) 

where  a*  =  [x\ ,  *2 . xn]*  V  =  [>’1 .  .1-2 »  --  >’«]  are  ^ie 

two  vectors  to  be  measured. 

To  investigate  the  variation  of  the  autocorrelation 
coefficients  r  for  each  time  window,  the  Manhattan 
distances  between  each  vector  are  calculated  with 
result  dj,  where  /  =  1, m.  m  =  n  the 

number  of  the  time  window  in  the  training  data 
set.  The  minimum,  maximum,  and  mean  value  of 
dj  are  recorded  as  the  variation  characteristics  of 
the  autocorrelation  coefficients  /*.  The  mean  vector  7 
is  also  recorded  for  comparison  purpose  in  the 
detecting  stage.  Similar  procedure  is  applied  to  the 
power  spectrum  density  p. 

Three  Validation  Gates,  sensor  value,  residual 
autocorrelation,  and  power  spectrum,  were  estab¬ 
lished  to  provide  redundancy  using  the  informa¬ 
tion  from  the  learning  stage.  The  basic  structure  of 
each  gate  (or  membership  function)  is  based  on 
the  dynamic  fuzzy  validation  curve.  The  mean 
values  from  the  three  memories  are  set  as  the 
highest  confidence  value  for  the  bell  shape  valida¬ 
tion  function.  The  minimum  and  the  maximum 
values  are  set  at  10%  confidence  value  points  as 
shown  in  Fig.  4. 

In  the  detecting  stage,  the  auto-correlation  of 
the  residual  of  the  sensor  output  was  calculated  in 
every  time  window  and  was  compared  with  the  mean 
autocorrelation  coefficients  r  in  the  Manhattan  dis¬ 
tance  by  the  corresponding  validation  gate.  The 
output  will  be  the  sensor  health  confidence  level  in 
autocorrelation  feature  domain.  A  similar  approach 
is  applied  to  the  power  spectrum  density  and  each 
sensor  output  datum.  Finally,,  the  three  outputs 
from  the  three  gates  are  fused  by  fuzzy  intersection 
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Fig.  4.  A  bell  shape  fuzzy  validation  gate. 

with  the  fuzzy  min  operator.  The  decision  level 
fusion  output  is  a  final  confidence  level  of  the 
sensor  health. 


3.  Simulation  results 

Simulation  was  used  to  demonstrate  the  expected 
performance  of  the  WTAE  networks.  Two  data 
sets  were  used  for  training  and  testing  the  sensor 
validation  system  in  our  studies.  The  first  bench¬ 
mark  data  set  was  generated  from  a  Spectra  Quest 
(SQ)  Machinery  Fault  Simulator.  The  second  data 
set  was  a  time-series  vibration  data  set,  commonly 
known  as  Westland  vibration  data. 

3.1.  SO  Machinery  Fault  Simulator  data  set 

This  data  set  consists  of  vibration  data  recorded 
from  the  SQ  Machinery  Fault  Simulator.  The 
instrument  is  constructed  with  special  kinds  of 
bearings,  rotors  with  split  collar  ends,  and  a  split 
bracket  bearing  housing.  The  simulator  offers  a 
wide  range  of  predictive  maintenance  and  the  sig¬ 
natures  of  various  bearing  faults. 

Here,  we  use  three  accelerometer  sensors  in 
proximity  as  the  inputs  x  to  the  WTAE  network, 
which  have  the  similar  high  frequency  waveform, 
to  approximate  the  outputs  y  of  the  fourth  sensor 
measurements  of  interest.  System  parameters,  Mr, 
o*/,  and  C,  are  referred  to  the  statistical  distribution 
of  the  ith  fuzzy  expert  network.  The  relationship 


of  the  sensors  is  obviously  nonlinear,  and  the 
modeling  work  is  very  complex.  The  sample  data 
set  was  separated  into  two  parts.  The  first  3000 
observation  pairs  are  used  as  the  training  set,  and 
the  remaining  3000  points  serve  as  the  testing  set. 
The  WTAE  network  was  trained  and  compared 
with  MLP  and  RBF  based  estimators. 

The  performance  function  for  the  experts  is  the 
mean  square  error  (MSE)  between  the  network 
outputs  and  the  target  output.  A  10  hidden  nodes 
MLP  structure  was  chosen  ad  hoc  to  be  the  local 
expert  network  with  a  Resilient  Back-propagation 
training  algorithm.  After  the  WTAE  network  had 
been  trained  for  41  min,  the  MSE  of  the  training 
set  was  0.0234.  The  resulting  network  incorpo¬ 
rated  23  local  experts,  with  a  pruning  criterion  of 
10  samples  per  sample  set.  The  mean  square  error 
of  the  3000  testing  samples  using  WTAE  networks 
was  0.1631.  The  thresholds,  E\  and  £2,  are  chosen 
to  be  0.1  and  0.001,  respectively. 

The  same  training  set  was  also  applied  for 
training  a  MLP  network.  The  total  number  of 
nodes  in  the  MLP  was  estimated  by  the  Vapnik- 
Chervonenlcis  (VC)  dimension  theory  [47,18]  with 
a  conservative  upper  bound  300.  After  training 
several  pre-selected  structures,  the  network  with 
the  best  result  was  a  MLP  with  a  150-node  hidden 
layer.  This  network  was  trained  by  Resilient  Back- 
propagation  algorithm  as  well.  After  41  min,  the 
MSE  was  0.0554.  Also,  the  networks  were  tested 
by  the  same  3000  samples  as  the  WTAE  networks 
did.  The  resultant  MSE  is  5.9529. 

The  third  estimation  method  used  for  comparison 
is  the  RBF  network.  Here,  we  used  the  /t-Means 
Clustering  Algorithm  to  train  the  first  layer  and 
the  Moore-Penrose  algorithm  for  the  second  layer 
[10].  Because  of  the  high  computational  cost 
involving  in  matrix  inversion,  the  RBF  networks 
can  only  be  trained  by  775  samples  in  41  min.  The 
resulting  MSE  for  the.  training  set  is  0.0550.  The 
number  of  first  layer  clusters  is  339  because  of  the 
complexity  of  the  input  space.  The  next  775  samples 
were  tested  with  a  MSE  of  0.0787. 

From  the  above  simulation  result,  we  can  draw 
the  following  conclusions.  Just  like  MLP  and  RBF 
based  neural  network,  WTAE  can  uniformly 
approximate  any  continuous  function  without 
knowledge  of  system  model,  as  long  as  the  number 
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of  experts  units  is  sufficient.  This  is  because  the 
basic  building  block  of  the  WTAE  is  a  small  single 
hidden  layer  MLP  network,  which  realize  the 
nonlinearity  of  the  system  locally.  This  is  a  great 
advantage  of  a  neural  networks  based  algorithm 
over  the  traditional  knowledge  based  tracking 
system.  WTAE  networks  have  outperformed  other 
neural  network  based  systems  in  complex  pro¬ 
blems.  Based  on  the  ‘divide  and  conquer'  strategy, 
the  model  employs  a  growing  fuzzy  clustering 
algorithm'  that  naturally  divides  the  input  space 
into  overlapping  regions  on  which  ‘experts’  act.  In 
this  way,  a  complicated  problem  is  divided  into  a 
series  of  simpler  sub-problems  and  assigned  a 
function  approximator  to  each  sub-problem 
locally. 

The  growing  fuzzy  clustering  algorithm  is  able 
to  make  soft  decision  boundaries  for  overlapping 
classes.  This  overlapping  is  another  feature  of  the 
input  sample  sets  in  sensor  tracking  problems.  The 


algorithm  we  used  overcomes  the  overlapping 
problems  by  using  Gaussian  fuzzy  membership 
functions  in  each  dimension  of  the  input  space.  A 
comparison  performance  among  the  three  algo¬ 
rithms  is  listed  in  Table  1. 

A  time  window  was  defined  with  a  length  of  300 
points.  Both  the  real  sensor  time  series  and  the 
tracking  time  series  were  transferred  to  the  fre¬ 
quency  domain  by  FFT  with  a  length  of  256.  Both 
the  power  spectrum  density  and  the  autocorrela¬ 
tion  of  the  residual  were  calculated.  The  necessary 
information  for  establishing  the  three  fuzzy  sensor 
validation  gates  was  recorded. 

Compared  with  the  traditional  time  domain 
indicators,  WTAE  network  is  more  reliable  and 
robust  in  a  noisy  environment,  as  shown  in 
Table  2.  When  the  amplitude  of  the  noise  is  not 
significant  enough  to  trigger  the  noise  indicator, 
the  traditional  indicators  could  not  detect  several 
failure  modes,  especially  ‘Cyclic’,  which  means  in 


Table  1 


A  comparison  performance  among  estimators 


41  mins  training 

WTAE  networks 

MLP  network 

RBF  network 

Architecture 

23  of  10  hidden  nodes  MLP  experts 

150  hidden  nodes 

339  clusters  in  the  first  layer 

Training  algorithm 

Growing  fuzzy  resilient-BP 

Resilient-BP 

A'-Means  clustering  Moore-Penrose 

Training  samples 

3000 

3000 

775 

MSE  of  training 

0.0234 

0.0554 

0.0550 

Testing  samples 

3000 

3000 

775 

MSE  of  testing 

0.1631 

5.9529 

0.0787 

Table  2 

Comparison  results  of  three  validation  gates  and  fuzzy  intersection  output 

Sensor 

Sensor  value 

Residual  autocorrelation 

Power  spectrum 

Fuzzy 

•states 

validation  gate 

validation  gate 

validation  gate 

intersection 

Manhattan 

Confidence 

Manhattan 

Confidence 

distance 

level 

distance 

level 

Normal 

0.9046 

1.6723e  +  003 

0.9832 

0.0771 

0.9447 

0.9046 

Hard-over 

2.8921e— 014 

6.1907e  +  005 

1 .06 13e — 008 

2.7423 

1.0854e— 024 

1.0854e— 024 

Bias 

7. 1865e— 016 

7.2108e  +  005 

1 .06 1 3e — 008 

3.1193 

Ol 

o 

I 

% 

IT) 

OO 

o 

1.0854c— 024 

Spike 

7.1858e— 016 

1.6460e  +  003 

0.9989 

0.0822 

0.9721 

7.1 858e — 016 

Stuck 

0.8460 

5.638 1  e  H-  003 

1 .059  le — 008 

1.5572 

1.0854e-024 

1.0854e-024 

Erratic 

0.9268 

1.5550e+  004 

1 .06 1 3e — 008 

0.4496 

1.0854e— 024 

1.0854e— 024 

Cyclic 

0.4501 

4.4328e+  004 

1.0613e— 008 

0.4698 

1.0854e— 024 

1.0854e— 024 

Drift 

1.2242e— 006 

6.1678e  +  004 

1 .06 13e — 008 

0.8432 

1.0854e-024 

3 

o 

1 

3 

in 

oo 

o 

Nonlinear 

5.2260e— 014 

3 . 1 871  e  H-  004 

1 .06 13e — 008 

1.1034 

1.0854e— 024 

1.0854e— 024 

Correction  Rate 

66.67% 

88.89% 

88.89% 

100% 
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Table  ? 


A  comparison  performance  among  estimators 


63  Mins  Training 

WTAE  networks 

MLP  network 

RBF  network 

Architecture 

Training  algorithm 
Training  samples 

MSE  of  training 
Testing  samples 

MSE  of  testing 

35  of  15  hidden  nodes  MLP  experts 
Growing  fuzzy 

3000 

0.3985 

3000 

2.1402 

1 50  hidden  nodes 

Resilient  BP  resilient  BP 
3000 

0.7756 

3000 

45.9567 

3S5  clusters  in  the  first  layer 

A*- Means  clustering  Moore-Penrose 
S13 

42.5571 

813 

42.3122 

the  same  condition  the  traditional  indicators  could 
only  detect  less  than  87.5%  of  the  available  failure 
states.  In  addition,  the  traditional  indicators  need 
sufficient  knowledge  of  the  characteristics  of  both 
the  sensor  and  the  environment,  which  is  unne¬ 
cessary  for  the  proposed  WTAE  network. 

3.2.  The  Westland  vibration  data  set 

The  Westland  data  set  [19]  was  acquired  using 
an  array  of  eight  accelerometers  fixed  in  specific 
locations  on  a  set  of  faulted  and  unfaulted  aft 
main  power  transmission  of  a  US  Navy  CH-46 
helicopter.  These  accelerometer-equipped  trans¬ 
missions  were  mounted  on  a  laboratory-based 
“test  rig’'  and  run  at  a  sampling  rate  of  103,1 16.08 
Hz.  The  sensor  validation  experiment  data  set  was 
sampled  at  a  no-fault  condition  at  one  of  the  several 
torque  load  levels  (i.e.  100%). 

Because  of  the  complexity  of  the  data  set,  we 
included  3000  samples  both  in  the  training  set  and 
testing  set.  The  estimation  results  were  listed  in 
Table  3. 

Although  we  added  more  hidden  neurons  to  the 
experts  than  the  first  application,  the  MSE  of  both 
training  and  testing  sets  are  still  significant.  The 
other  two  networks  also  share  the  similar  results 
as  WTAE  network.  The  possible  reason  is  that  the 
sensors  used  are  probably  not  strongly  correlated. 


4.  Conclusions 

An  architecture  for  estimating  sensor  measure¬ 
ments  and  detecting  sensor  failure  is  developed  in 
this  paper.  The  method  allows  us  to  estimate  a 
critical  sensor  value  when  other  neighboring  sensors 
measurements  are  used  as  inputs.  Three  fuzzy 


sensor  validation  gates  based  on  the  information 
from  both  the  time  domain  and  the  frequency 
domain  were  used  to  detect  the  sensor  failure.  The 
network  is  a  synergetic  combination  of  fuzzy  logic 
and  neural  network.  It  employs  both  the  fast  par¬ 
allel  computation  and  learning  capability  of  neural 
networks,  and  fuzzy  logic's  ability  to  represent 
and  manipulate  imprecise  information. 

The  WTAE  network  consists  of  two  main  layers: 
the  fuzzy  membership  clustering  layer  and  the 
MLP  experts  layer.  The  cluster  layer  employs  the 
Gaussian  radial  basis  function  as  a  fuzzy  mem¬ 
bership  function.  The  general  idea  is  to  divide  a 
complicated  problem  into  a  series  of  simpler  sub- 
problems  and  assign  a  function  approximator  to 
each  sub-problem.  A  growing  fuzzy  membership 
clustering  method  is  used  to  divide  the  input  space 
into  overlapping  regions  on  which  'experts’  act. 
After  the  WTAE  network  was  fully  trained  in  the 
sensor  nominal  state,  the  estimation  result  was 
compared  with  real  sensor  outputs  by  three  fuzzy 
validation  gates,  which  were  built  up  based  on  the 
information  collected  in  the  learning  stage.  The 
auto-correlation  of  the  resiauN  and  the  power 
spectrum  of  the  time  series  are  analyzed  using  the 
Manhattan  distance.  The  results  from  the  three 
validation  gates  were  combined  together  by  fuzzy 
intersection  logic.  This  decision  level  fusion  finishes 
the  whole  sensor  failure  detection  procedure  pro¬ 
viding  the  final  confidence  level  of  the  interested 
sensor  health. 

Two  benchmark  data  sets,  the  SQ  Machinery 
Fault  Simulator  data  set  and  the  Westland  vibration 
data  set,  were  used  in  simulation  studies  to 
demonstrate  the  performance  of  the  WTAE  network. 
Comparisons  between  the  WTAE  networks  and 
the  other  two  neural  networks  estimators  were 
made.  The  results  show  that,  in  terms  of  estimation 
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performance,  the  WTAE  is  competitive  with  or 
even  superior  to  the  MLP  and  RBF  networks. 
Furthermore,  the  fuzzy  sensor  validation  gates 
algorithm  was  used  to  investigate  the  eight  sensor 
failure  modes.  The  results  from  the  simulation 
studies  have  shown  that  the  proposed  sensor  vali¬ 
dation  algorithm  is  capable  of  detecting  all  eight 
faults  as  long  as  the  basic  assumption  that  the 
neighboring  sensors  have  an  analytical  relation¬ 
ship  is  valid. 
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Abstract 

An  innovative  neurofuzzy  network  is  proposed  herein  for  pattern  classification  applications,  specifically  for  vibration 
monitoring.  A  fuzzy  set  interpretation  is  incorporated  into  the  network  design  to  handle  imprecise  information.  A 
neural  network  architecture  is  used  to  automatically  deduce  fuzzy  if-then  rules  based  on  a  hybrid  supervised  learning 
scheme.  The  neurofuzzy  classifier  proposed  is  equipped  with  a  one-pass,  on-line,  and  incremental  learning  algorithm. 
This  network  can  be  considered  a  self-organized  classifier  with  the  ability  to  adaptively  learn  new  information  without 
forgetting  old  knowledge.  The  classification  performance  of  the  proposed  neurofuzzy  network  is  validated  on  the 
Fisher’s  Iris  data,  which  is  a  well-known  benchmark  data  set.  For  the  generalization  capability,  the  neurofuzzy  network 
can  achieve  97.33%  correct  classification.  In  addition,  to  demonstrate  the  efficiency  and  effectiveness  of  the  proposed 
neurofuzzy  paradigm,  numerical  simulations  have  been  performed  using  the  Westland  data  set.  The  Westland  data  set 
consists  of  vibration  data  collected  from  a  US  Navy  CH-46E  helicopter  test  stand.  Using  a  simple  fast  Fourier  trans¬ 
form  technique  for  feature  extraction,  the  proposed  neurofuzzy  network  has  shown  promising  results.  Using  various 
torque  levels  for  training  and  testing,  the  network  achieved  100%  correct  classification.  ©  2000  Elsevier  Science  Ltd. 
All  rights  reserved.  -  ‘  yj. 

Keywords:  Pattern  classification;  Neural  networks;  Fuzzy  logic;  Neurofuzzy  classifier;  Supervised-clustering  learning;  Membership 
functions;  Rule  bases;  Hybrid  architecture;  Radial  basis  function 


1.  Introduction 

In  machine  health  monitoring  systems,  pattern 
classification  is  a  key  component  in  identifying 
failure  modes  created  by  the  monitored  systems. 
Service  and  maintenance  can  be  promptly  and 
correctly  performed  if  the  pattern  classifier  makes 
an  accurate  recommendation.  While  operating, 
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mechanical  components  exhibit  some  physical 
behaviors,  such  as  temperature,  pressure,  electro¬ 
magnetic  variation,  eddy  current,  acoustic  emis¬ 
sion,  and  vibration,  which  contain  information 
about  the  state  of  the  machine.  These  physical 
behaviors  are  sensed  by  transducer  systems  to 
obtain  the  data  used  for  detecting  and  diagnosing 
some  of  the  incipient  failures  of  the  machinery  and 
equipment.  The  input  data  is  entered  into  a  clas¬ 
sifier  which  is  a  component  of  a  health  monitoring 
system.  Using  pattern  classification  techniques, 
signatures  containing  information  about  machine 
defects  and  their  causes  can  be  extracted  from  the 
data.  With  the  accurate  decision  of  a  classifier  in  a 
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monitoring  system,  machine  maintenance  can  be 
performed  before  catastrophic  failures  occur. 

When  a  machine  is  operating  properly,  the  phy¬ 
sical  behaviors,  for  example  vibrations,  are  gen¬ 
erally  small  and  constant.  However,  when  faults 
develop  which  lead  to  variations  of  process 
dynamics,  the  physical  signatures  (i.e.  power  spec¬ 
trum  density,  natural  frequency,  and  mode  shape) 
also  change.  To  detect  these  changes,  classical  off¬ 
line  iterative  learning  classifiers  were  proposed  to 
supervise  a  monitored  system.  These  classifiers 
-  have  a  drawback  in  that  they  generally  require  a 
long  training  time.  In  addition,  they  are  often 
stuck  at  local  minima  and  are  unable  to  achieve 
the  optimum  solution. 

Furthermore,  in  an  operating  mode,  it  is  possi¬ 
ble  that  novel  faults  are  evolving  while  a  mon¬ 
itored  system  is  running.  These  faults  are  different 
from  those  that  have  been  trained  to  the  classifier 
and  need  to  be  promptly  detected  and  dis¬ 
tinguished  from  those  that  have  been  trained  to 
the  classifier.  Conventional  neural  classifiers  need 
to  be  retrained  by  both  old  and  new  data  in  order 
to  learn  new  information  while  remembering 
existing  information  [1].  It  is  possible  to  learn  and 
detect  new  fault  patterns  on-line  real-time  if  an 
incremental  learning  classifier  is  used.  The  classi¬ 
fier  developed  in  this  study  provides  the  feature  of 
learning  new  fault  types  incrementally  without  re¬ 
learn  the  existing  patterns  that  have  been  learned. 
Thus,  the  proposed  is  easily  applied  in  a  monitor¬ 
ing  system  to  detect  unseen  faults  while  the  system 
is  in  an  operating  mode. 


2.  Literature  reviews 

Pattern  classification  forms  a  fundamental  solu¬ 
tion  to  problems  in  real  world  applications.  The 
function  of  pattern  classification  is  to  categorize 
an  unknown  pattern  into  a  distinct  class  based 
upon  a  suitable  similarity  measure.  Thus,  similar 
patterns  are  placed  in  the  same  class  while  dissim¬ 
ilar  patterns  are  classified  into  different  classes. 

Engineers  and  scientists  have  developed  various 
methodologies  to  deal  with  pattern  classification 
problems.  Statistical  pattern  classification  is  a  tra¬ 
ditional  technique  for  classification  problems  [2]. 


This  classical  classification  technique  makes  use  of 
statistical  decision  theory  to  classify  patterns. 
Various  researchers  have  scrutinized  parametric 
Bayesian  classifiers  [3]  assuming  that  the  forms  of 
input  distributions  are  known.  The  parameters  of 
distributions  are  computed  using  all  training  data. 
The  training  data  is  usually  assumed  to  be  Gaus¬ 
sian  when  using  Bayesian  classifiers.  Because  of 
their  simplicity,  they  are  stilFwidely  used  [4,5]. 

Automatic  pattern  classification  has  been 
actively  pursued  by  scientists  and  engineers  from 
different  fields.  Many  researchers  in  the  area  of 
pattern  classification  have  paid  attention  to  neural 
network  classifiers  because  of  the  capability  of 
model-free  and  trainable  systems,  parallel  compu¬ 
tation,  and  noise  tolerance.  These  properties  of 
artificial  neural  networks  inspire  researchers  to 
study  neural  network  applications  to  deal  with 
pattern  classification  problems.  Neural  networks 
with  the  abilities  of  real-time  learning,  parallel 
computation,  and  .self-organization  make  pattern 
classification  more  suitable  to  handle  complex 
classification  problems  through  their,  learning  and 
generalization  abilities  [6,7]. 

In  addition,  fuzzy  set  theory  [8]  has  been  exten¬ 
sively  applied  to  pattern  classification..  Fuzzy  set 
theory  supports  pattern  classification  by  dealing 
with  inexact  rather  than  exact  notions.  Fuzzy  sys¬ 
tems  perform  well  on  uncertain  information,  very 
similar  to  the  way  human  reasoning  does.  In  the 
real  world,  most  situations  are  fuzzy  rather  than 
crisp.  Moreover,  the  information  in  pattern  classi¬ 
fication  problems  is  impre  rise  rather  than  precise 
in  nature,  and  fuzzy  set  theory  allows  us  to  prop¬ 
erly  model  this  vague  information  [9,10]. 

The  integration  of  neural  networks  and  fuzzy 
sets  is  also  an  active  area  for  pattern  classification 
problems.  A  growing  number  of  researchers  have 
designed  and  examined  various  forms  of  fuzzy 
neural  or  neurofuzzy  networks.  The  idea  is  to 
integrate  the  capabilities  of  model-free  and  train- 
able  systems,  parallel  computation,  and  noise  tol¬ 
erance  of  neural  networks  and  the  ability  of  fuzzy 
set  theory  to  deal  with  imprecise  situations.  The 
combination  of  neural  networks  and  fuzzy  sets 
forms  a  synergetic  network  that  handles  pattern 
classification  problems  very  effectively  and  effi¬ 
ciently.  Some  learning  algorithms  of  fuzzy  neural 
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or  nenrofuzzy  networks  can  be  found  from  [1,11- 
14].  Fuzzy  neural  or  neurofuzzy  networks  have 
been  widely  used  in  many  applications  in  pattern 
classification  problems  shown  in  [11-17]. 

2.1.  Existing  problems 

Pattern  classification  techniques  have  become 
important  in  handling  many  real-world  applica¬ 
tions.  As  a  complement  to  statistic  classifiers,  neural 
network  classifiers,  fuzzy  classifiers,  and  neural- 
fuzzy  classifiers  have  been  applied  to  deal  with 
classification  problems.  However,  those  neural  net¬ 
work,  fuzzy,  and  neural-fuzzy  classifiers  have  some 
deficiencies  in  many  aspects. 

For  example,  the  mutilayer  percerotron  (MLP) 
and  the  learning  vector  quantization  (LVQ)  classi¬ 
fiers  require  off-line  training  and  iterative  presenta¬ 
tion  of  the  training  data.  Thus,  they  use  extensive 
training  time  to  learn  input  patterns.  Furthermore, 
these  networks  need  predetermination  of  their 
architecture  parameters.  Repeated  design  work  is 
needed  to  obtain  the  suitable  parameters  to  achieve 
reasonable  performance  in  classifying  patterns. 
Moreover,  sometimes  they  fail  in  the  learning  pro¬ 
cess  by  being  unable  to  converge  to  the  optimal 
solution  because  the  initial  random  condition  is  not 
properly  chosen.  Hence,  MLP  and  LVQ  classifiers 
are  difficult  to  apply  to  pattern  classification  pro¬ 
blems  that  need  fast,  on-line,  real-time,  incremental 
learning. 

Unlike  neural  net  classifiers,  fuzzy  classifiers  can 
handle  uncertain  information  by  providing  a  soft 
decision  that  allows  a  pattern  to  belong  to  more 
than  one  class  with  different  membership  degrees. 
However,  constructing  “if-then”  rules  for  fuzzy 
classifiers  requires  knowledge  from  experts  and  a 
time-consuming  design  process,  especially  when 
the  dimension  of  the  feature  space  is  large.  Thus, 
traditional  fuzzy  classifiers  are  not  suitable  for  on¬ 
line,  real-time,  incremental  learning  pattern  classi¬ 
fication. 

The  aim  of  this  study  is  to  provide  a  way  for 
determining  fuzzy  if-then  rules  during  the  learning 
phase.  The  proposed  method  can  automatically 
construct  the  if-then  rules  of  a  fuzzy  inference 
system  using  the  learning  capability  of  a  neural 
network.  Input  and  target  pairs  are  used  as  a 


“teacher"  for  the  network  under  supervised  learn¬ 
ing.  This  produces  an  incremental,  on-line,  one- 
pass  learning  algorithm.  This  model  uses  Gaussian 
membership  functions  in  both  the  antecedent  and 
consequent  parts.  The  proposed  method  employs 
supervised  clustering-based  partitioning. 

3.  Architecture  of  the  proposed  neurofuzzy  net¬ 
work 

The  proposed  neurofuzzy  network  is  developed 
based  on  a  standard  fuzzy  logic  system  [18].  A 
standard  fuzzy  logic  system  has  four  components:  a 
fuzzifier,  a  fuzzy  rule  base,  an  inference  engine,  and 
a  deffuzifier.  The  function  of  the  fuzzifier  is  to 
determine  the  degree  of  membership  of  a  crisp  input 
in  a  fuzzy  set.  The  fuzzy  rule  base  is  used  to  repre¬ 
sent  the  fuzzy  relationships  between  input  and  out¬ 
put  fuzzy  variables.  The  output  of  the  fuzzy  rule 
base  is  determined  based  on  the  degree  of  member¬ 
ship  specified  by  the  fuzzifier.  The  inference  engine 
controls  the  rule  base.  Optionally,  the  defuzzifier  is 
used  to  convert  the  outputs  of  the  fuzzy 'rule  base 
into  crisp  values.  The  network  architecture  of  the 
proposed  neurofuzzy  classifier  is  shown  in  Fig.  1. 

The  proposed  neurofuzzy  system  has  three  lay¬ 
ers:  one  input  layer,  one  hidden  or  rule  layer,  and 
one  output  layer.  In  the  input  layer,  each  neuron 
connects  to  each  element  of  an  M-dimensional 
input  vector.  The  hidden  layer  or  rule  layer  sepa¬ 
rates  into  two  parts:  an  antecedent  part  and  a 
consequent  part.  The  antecedent  and  consequent 
together  construct  a  fuzzy  rule  for  a  fuzzy  infer¬ 
ence  system.  For  each  rule,  the  antecedent  part 
consists  of  M  membership  nodes  and  a  fuzzy 
“AND”  activation  function  node.  Gaussian 
membership  functions  are  employed  at  the  mem¬ 
bership  nodes.  Each  membership  node  of  the 
antecedent  part  orderly  connects  to  each  node  of 
the  M-dimensional  input  layer  via  a  dynamic 
synaptic  weight  matrix,  WP,  whose  rows  represent 
prototype  vectors  which  are  the  centroids  of 
Gaussian  radial  basis  functions.  When  adding 
more  rules  in  the  hidden  layer,  WP  adds  more 
rows.  WP  is  a  long- term-memory  trainable  weight. 
In  the  antecedent  parts,  a  fuzzification  procedure 
and  a  fuzzy  “AND”  operation  are  performed. 
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Fig.  1.  The  proposed  neurofuzzy  network  architecture. 


The  consequent  part  consists  of  target  member¬ 
ship  functions  and  a  fuzzy  “AND”  activation 
function.  Consequent  membership  functions  are 
Gaussian  membership  functions  (centered  by  the 
elements  of  WT)  as  in  the  antecedent  part.  The 
consequent  of  a  fuzzy  rule  assigns  an  entire  fuzzy 
set  to  the  output.  This  fuzzy  set  is  represented  by  a 
membership  function  that  is  selected  to  show  the 
qualities  of  the  consequent.  If  the  antecedent  is 
only  partially  true,  (i.e.  is  taken  a  value  less  than 
“I”),  then  the  output  fuzzy  set  is  truncated  which 
is  often  referred  to  as  implication  method  [18]. 
After  implication  procedure,  the  fuzzy  numerical 
output  of  the  rule  layer  is  then  transmitted  to  the 
output  layer. 

The  output  layer  is  composed  of  an  aggregation 
procedure  and  a  defuzzification  procedure.  The 
aggregation  step  combines  the  inference  results  of 
fuzzy  rules  from  the  hidden  layer  by  super¬ 
imposing  all  fuzzy  conclusions  about  a  variable. 
The  aggregation  procedure  employs  a  fuzzy  “OR” 
(maximum)  method.  For  the  defuzzification  step, 
the  mean  of  maximum  method  (MOM)  is  used  to 
obtain  the  final  crisp  output.  The  MOM  defuzzifi¬ 
cation  calculates  the  average  of  all  output’s  vari¬ 
able  values  with  maximum  membership  degrees. 


3.1.  Mathematical  model  of  the  neurofuzzy 
classifier 

Let  1ZM  be  a  pattern  vector  space.  Let  p  = 
[p\Pi  ■  ■  -Pm]T£  be  an  input  pattern'.  Each  ele¬ 
ment  of  the  input  pattern  is  a  measurement  or 
feature  and  each  corresponds  to  one  dimension 
(axis)  in  the  space.  For  M  elements  of  the  input 
pattern  we  have  an  A/-dimensional  space,  or  M- 
space.  Let  vectors  wP /  =  [vvP,i  wPi2  . . .  wp/^]r, 
i  =  1 , . . . ,  L,  be  pattern  prototypes  of  the  pattern 
space.  L  denotes  the  number  of  pattern  proto¬ 
types.  Let  matrix  Wp  =  [wpj  wP2 . . .  wP£]r  be  a 
synaptic  weight  matrix  whose  row  vectors  repre¬ 
sent  prototypes  of  the  pattern  space.  Pattern  pro¬ 
totypes  are  the  centers  of  the  Gaussian 
membership  functions  of  the  antecedent  part  in 
the  rule  layer.  A  class  may  have  more  than  one 
prototype.  Each  prototype  wp,-s  i  =  1, . . . ,  L,  is  the 
mean  vector  of  the  patterns  that  belong  to  the  zth 
rule.  The  antecedent  part  of  a  rule  is  constructed 
from  a  pattern  prototype.  Each  rule  consists  of  M 
membership  nodes.  Let  7ll  be  a  one-dimensional 
target  space.  Let  t  e  1ZX  be  a  corresponding  target 
or  class  of  the  input  pattern  p.  Let  vector  WT  = 
[k»tiWt2  . .  •  wjl]T  be  a  target  vector  whose  each 
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element,  wji,  i  =  1 . L.  represents  a  target  of  a 

prototype  stored  in  WP  in  the  same  order  as  their 
neurons.  Each  element  of  Wx  is  a  center  of  a 
Gaussian  membership  function  of  the  consequent 
part  in  the  rule  layer.  The  number  of  rows  of  WP 
and  Wj  are  the  same  and  they  grow  dynamically 
as  more  rules  are  added  to  the  hidden  layer. 

3.2.  Gaussian  membership  function  and  similarity 
function 

Gaussian  membership  functions  are  employed 
in  the  hidden  layer  to  represent  the  degree  of 
similarity  between  the  input  pattern  and  the  refer¬ 
ence  prototypes.  For  an  M-dimensional  pattern 
space.  M  membership  functions  are  used  to  form  a 
prototype.  These  Gaussian  membership  functions 
form  the  antecedent  part  of  a  fuzzy  rule.  A  Gaus¬ 
sian  membership  function  is  computed  by  the  fol¬ 
lowing  equation: 

\  2" 

Wp/m  ^  Pm  \ 

Victim  )  _  ’ 
m  =  l,...,M;z  =  (1) 

where  p(pm,  ivPl/„)  is  a  membership  degree  of  pm 
that  belongs  to  prototype  wP/  in  the  with  dimen¬ 
sion;  pm  is  the  with  element  of  an  M-dimensional 
input  pattern;  wPim  represents  the  center  of  a 
Gaussian  membership  function  of  prototype  wP/  in 
the  with  dimension;  oim  is  the  standard  deviation  of 
a  Gaussian  function  at  the  with*  dimension  of  the 
zth  prototype,  i.e.  the  standard  deviation  of  zth 
prototype  07  =  [oyj ,  a#, . . . ,  criAf],  i—  1 , . . . ,  L;  and 
L  is  the  number  of  prototypes  in  the  pattern  space. 

A  degree  of  similarity  between  the  input  pattern 
p  and  the  reference  prototype  wP/  is  called  “simi¬ 
larity  function”  determined  as  the  following 
equation: 


MOm,  w?im)  =  exp 


-( 


wPi*)  =  rmn(ji(pmi  wp/*)),  i  =  1, . . . ,  L.  (2) 

A  prototype  with  highest  membership  degree  is 
called  a  “winner  prototype”  which  can  be  found 
by  the  following  equation: 


wp  j  = 


>vP/  max(pi-5(p,  wP;)  x  CFj) 

1  1 


(3) 


where  J  e  {1 . L)  is  the  winner's  index:  and  CF, 

is  the  confident  factor  of  ith  prototype.  CFh  which 
will  be  discussed  later,  is  determined  by  taking 
account  for  the  frequency  of  patterns  that  fall  into 
the  region  of  the  zth  prototype  with  respect  to  all 
available  patterns  in  the  training  set.  The  term 
fuzzy  “AND”  may  be  alternatively  used  instead  of 
the  operator  “min.”  The  term  fuzzy  “OR”  and  the 
operator  “max”  may  be  used  interchangeably. 

The  similarity  function,  /x$(p,  wP/),  is  defined 
such  that  for  all  patterns,  p's,  within  the  region 
describing  wP/,  i  —  1, . . . ,  L,  there  exists  a  function 
P>  wP ,)  such  that  /^(p,  wP/)  >  /x5(p}  wPy)  for  all 
j  i1  f  Fig.  2a  illustrates  an  example  of  a  one¬ 
dimensional  pattern  space  with  three  pattern  pro¬ 
totypes:  wpi,  wP2,  wP3 .  Since  this  is  a  one-dimen¬ 
sional  pattern  space,  only  one  membership 
function' is  used  for  each  prototype.  Fig.  2a  is  an 
example  of  point  p  classified  to  be  the  class  of  wP2 
since  /x5(p,  wP2)  is  larger  than  both  ^(p,  wPii)  and 
Ms(p,  wP,3). 

Fig.  2b  shows  an  example  of  a  two-dimensional 
pattern  space  with  3  prototypes.  For  two-dimen¬ 
sional  pattern  space,  each  prototype  is  formed 
using  two  membership  functions.  Applying  Eqs. 
(l)-(3),  by  inspection,  pattern  p  in  Fig.  2b  is 
determined  to  be  the  class  of  prototype  wP1  since 
A^(p,  wP1)  is  larger  than  both  ^(p,  wP2)  and 
f^s( P,  wP3).  The  above  idea  applies,  in  similar 
fashion,  to  an  arbitrary  M-dimensional  pattern 
space  with  L  pattern  prototypes. 

3.3 .  Neurofuzzy  classifier  learning  scheme 


In  the  neurofuzzy  network,  there  are  two  cases 
of  learning:  learning  of  new  prototypes  and  learn-* 
ing  of  existing  prototypes.  In  the  first  case,  learn¬ 
ing  of  new  prototypes  is  the  way  that  a  new 
prototype  is  added  in  the  hidden  layer,  i.e.  WP  and 
WT  grow  new  neurons.  This  case  is  conducted 
when  a  new  prototype  is  found.  WP  grows  by 
adding  a  current  input  pattern  p  to  its  structure. 
Pattern  p  forms  a  new  prototype  for  the  pattern 
space,  i.e.  pattern  p  is  the  new  centers  of  Gaussian 
membership  functions  in  the  antecedent  part  of 
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Hard  decision  Hard  decision 


b) 

Fig.  2.  (a)  One-dimensional  pattern  space;  (b)  two-dimensional  pattern  space. 


the  rule  layer.  Similarly,  WT  grows  a  new  neuron 
when  WP  grows.  A  target  label  of  the  pattern  p  is 
added  to  Wx  and  it  is  constructed  to  be  a  new  center 
of  the  consequent  Gaussian  membership  function. 

The  second  case,  learning  of  existing  prototypes, 
is  applied  when  a  pattern  p  is  considered  to  have 
the  same  target  as  an  existing  prototype.  This 
learning  process  adapts  the  shape  of  existing 
membership  functions,  i.e.  means  and  standard 
deviations  of  Gaussian  membership  functions. 
The  learning  of  existing  prototype  of  the  neuro- 
fuzzy  network  takes  place  in  the  antecedent  part 
of  the  hidden  (rule)  layer.  (The  learning  of  existing 
prototypes  does  not  take  place  in  the  consequent 
part  since  the  means  and  standard  deviations  of 
membership  function  in  the  consequent  part  are 
fixed.) 


In  the  learning  of  an  existing  prototype  case,  the 
neurofuzzy  network  employs  a  supervised-cluster- 
ing  learning  paradigm  that  clusters  input  data  using 
the  corresponding  targets  to  validate  group  proto¬ 
types.  A  presented  input  pattern  is  measured  using 
the  similarity  with  existing  cluster  prototypes.  If  the 
criterion  of  similarity  is  met,  it  is  included  into  the 
selected  prototype  provided  that  it  has  the  same 
target.  Once  an  input  pattern  is  included  in  a  pro¬ 
totype,  only  the  parameters,  i.e.  count,  sum  of 
square,  mean,  and  standard  deviation,  of  the 
selected  prototype  are  updated  as  follows: 

4n+!)=  C$°  +  1,  (4) 

ssqy +1)  =  ssqy 1  +  p2, 


(5) 
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f/J-rl )  _ 

P  J  ~ 


p(«+l)  i\  ,  _ 

-j  ~  P 


^ J 


(6) 


aJ  ) 


&0 


for  C/  =1. 


N 


-qri»-cr,)(wgr,)) 


c(rlj  - 1 


otherwise. 


(7) 


77  represents  a  time  index.  C '/  represents  the  num¬ 
ber  of  inputs  that  have  been  counted  into  the  7th 
prototype.  Each  element  of  wp j  is  the  center  of  a 
Gaussian  membership  function  of  the  7th  proto¬ 
type.  ssqy  is  the  sum  of  square  of  patterns  that 
have  been  included  to  the  7th  prototype.  The 
standard  deviation,  07,  will  be  used  to  indicate  the 
spread  of  the  data  in  the  7th  prototype.  00  is  the 
initial  standard  deviations  representing  the  iso¬ 
tropic  spread  in  pattern  space  of  a  new  prototype 
for  the  first  sample. 


3.4.  Fuzzy  if-then  rule  bases 


In  standard  fuzzy  system,  for  a  finite  class  pat¬ 
tern  classification  problem  with  an  M-dimensional 
pattern  space,  linguistic  knowledge  can  be  written 
as  a  set  of  fuzzy  if-then  rule  as  follows: 


Rj :  IFpi  is  An  AND  . . .  AND/;*/  is  Aim 
THEN  Class  T,-  with  confident  factor  =  CFj 


fuzzy  classifier  does  not  employ  fuzzy  linguistic 
rules.  However,  a  mapping  algorithm  can  be  used 
to  map  fuzzy  if-then  rules  of  the  neurofuzzv  clas¬ 
sifier  into  fuzzy  linguistic  form  that  is  a  more 
comprehensible  form  for  human  users. 

Fuzzy  if-then  rules  of  the  neurofuzzv  classifier 
automatically  constructed  can  be  interpreted  as 
follows: 

•  Rule  1:  IF  (p\  is  An)  AND  (p2  is  A\2)  AND 
. . .  AND  (pM  is  Axm)  THEN  Class  is  Ti  with 

CFu 

•  Rule  2:  IF  (px  is  A2 1)  AND  (p2  is  ^22)  AND 
. . .  AND  (pM  is  A2m)  THEN  Class  is  T2  with 
CF2; 

•  Rule  L:  IF  (pi  is  ALi)  AND  (p2  is  AL2)  AND 
. . .  AND  (pM  is  Alm)  THEN  Class  is  TL 

with  CFl\  where  pm,  777=1 . M,  is  the 

;77th  element  of  an  input  pattern  p;  Aim  is  an 
antecedent  membership  function  at  node  m 
of  the  zth  rule,  i  =  1  (an  antecedent 

membership  function  is  a  Gaussian  radial 
basis  function  centered  by  an  element  in 
vector  wP/);  TJ,  represents  a 

consequent  membership  function  of  the  zth 
rule  (i.e.  Tt-  is  the  zth  consequent  membership 
function  which  is  a  Gaussian  function  cen¬ 
tered  by  the  zth  element  in  WT);  and  C7}, 

/  =  1, . . . ,  A,  is  the  confident  factor  of  the  zth 
rule.  By  applying  the  concept  of  the  prob¬ 
ability,  CFj  can  be  determined  by  the  fol¬ 
lowing  equation: 


where  R/,  i  =  1, 2 . . . ,  L,  is  the  label  of  the  zth  rule, 
Aim,  m  =  1,  1, . . . ,  M;  is  the  antecedent  fuzzy  set  of 
the  zth  rule  for  the  mth  dimension;  7}  is  the  con¬ 
sequent  class;  CFi  is  the  grade  of  certainty  that  the 
input  pattern  belongs  to  class  T>;  and  L  is  the 
number  of  fuzzy  rules.  Each  antecedent  fuzzy  set 
Aim  has  a  linguistic  label  such  as  “small,”  “med¬ 
ium,”  or  “large.”  The  grade  of  certainty  CFj  is 
usually  given  as  a  real  number  in  the  unit  interval 
“0”  and  “1.” 

From  a  knowledge  representation  viewpoint,  it 
is  preferable  to  represent  fuzzy  rules  in  linguistic 
form  as  in  the  standard  fuzzy  system.  For  the  sake 
of  fast,  one-pass,  learning  behavior,  the  neuro- 


CF,-  =  -  Fi  —  ,/=  1 . L,  (8) 

j'eClassOiT,-) 

where  C/  is  a  count  parameter  of  the  zth  rule  (i.e*. 
zth  prototype)  obtained  when  a  pattern  is  included 
into  the  zth  prototype;  and  Class (wT/)  =  {k\wTk  = 

wTii  k  =  1 

3.5.  Defuzzification  method 

A  defuzzification  process  is  applied,  in  the  out¬ 
put  layer,  to  obtain  the  final  crisp  output.  The 
defuzzification  method  used  in  this  study  is 
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defined  as  the  mean  of  maximum  (MOM)  by  the 
following  equation: 

2> 

MOM  =  -  -  '=! - : — ,  (9) 

??  of  elements  my 

)'  =  j.v  MO-)  =  ma x(m( U)  j .  (10) 

where  y=  [v\  .  ..yo]  is  the  set  of  output  values 
with  the  highest  membership  degree;  0  is  the 
dimension  of  y;  Y  is  the  output  “universe  of  dis¬ 
course”  which  is  the  range  of  values  the  outputs 
can  take  on;  and  fi (y)  defines  a  membership  func¬ 
tion  for  y. 

3.6.  N euro  fuzzy  classifier  algorithm 

The  neurofuzzy  network  will  be  automatically 
constructed  on  the  run  by  supervised-clustering 
algorithm.  (Supervised-clustering  learning  distin¬ 
guishes  patterns  into  sub-clusters  by  using  the 
corresponding  targets  of  the  input  patterns  to 
decide  whether  the  data  should  be  included  in  the 
same  group.)  Initially,  the  network  contains  no 
rules  in  the  hidden  layer.  When  the  training  data  is 
presented,  the  network  learns  to  construct  its 
weights  and  connections. 

Two  parameters  needed  for  the  neurofuzzy 
classifier  are  the  initial  standard  deviation,  oo,  and 
the  constraint  standard  deviation,  ac.  When  a 
pattern  is  included  to  a  cluster,  the  standard 
deviation  is  updated  depending  on  the  distribution 
of  patterns  in  that  cluster.  However,  it  is  possible 
that  the  updated  standard  deviation  becomes  very 
small  near  zero  causing  a  problem  divided  by  zero. 
To  avoid  the  divided-by-zero  problem,  a  standard 
deviation  near  zero  needs  to  be  set  to  a  constraint 
value  larger  than  zero,  for  example  0.0001.  The 
classification  algorithm  of  the  neurofuzzy  classifier 
is  as  follows. 

•  STEP  1:  Initialization 

o  There  are  no  neurons  in  the  hidden  layer, 
o  Set  count  for  each  node  to  zero, 
o  Set  sum  of  square  to  zero, 
o  Set  the  initial  standard  deviation  of 


Gaussian  membership  function  to  a  small 
number  value,  for  example  0.1. 
o  Set  the  constraint  for  standard  deviation 
to  a  small  number  value,  for  example 
0.0001. 

o  Use  the  first  input  pattern  to  generate  the 
antecedent  membership  functions  and 
other  variables: 

-  Set  each  element  of  the  first  input 
pattern  to  be  the  centers  or  means  of 
the  Gaussian  functions,  i.e.  set  the 
mean  of  each  node  equal  to  each  ele¬ 
ment  of  the  input  pattern. 

-  Set  the  standard  deviation  of  each  node 
equal  to  the  initial  standard  deviation. 

-  Set  the  sum  of  square  for  each  node  equal 
to  each  element  in  the  input  pattern. 

o  Construct  the  first  membership  function 
of  the  consequent  part  using  the  first  cor¬ 
responding  target  of  the  first  input  to  be 
the  center  of  the  membership  function. 

•  STEP  2:  Input  presentation 

o  Read  in  the  next  input  pattern  (and, 
optionally,  the  next  corresponding  target), 
o  Fuzzify  each  input  element  using  the 
membership  functions  in  the  membership 
nodes.  Each  crisp  element  of  the  input 
pattern  is  fuzzified  into  membership 
values  by  using  parameters  of  corre¬ 
sponding  node,  i.e.  its  mean  and  standard 
deviation.  Each  node  will  have  a  member¬ 
ship  degree  for  each  corresponding  input 
element.  (This  is  in  the  antecedent  part.) 
o  Apply  fuzzy  “AND”  (min)  operator  to 
membership  values  from  each  member¬ 
ship  node  which  yields  only  one  member 
value  from  fuzzy  “AND”  (min)  operator, 
o  Apply  implication  method  to  the  mem¬ 
bership  functions  of  the  consequent  part 
using  fuzzy  “AND”  (min)  operator, 
o  If  there  is  no  presentation  of  the  corre¬ 
sponding  target  of  the  input  then  skip 
STEP  3. 

•  STEP  3:  Learning  step 

o  If  there  is  a  presentation  of  the  corre¬ 
sponding  target  of  the  input  (i.e.  the  sys¬ 
tem  is  in  training  mode),  determine  which 
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rule  the  input  most  satisfies  (i.e.  the  rule 
that  the  input  belongs  to)  by -applying 
fuzzy  “OR”  (max)  operation  to  the  out¬ 
put  of  the  fuzzy  “AND”  (min)  operation, 
o  If  the  winner  rule  meets  the  criteria  of 
belonging  (i.e.,  its  prototype  has  the 
highest  membership  degree  and  the  target 
of  the  input  pattern  is  the  same  as  the 
:  predicted  class),  then  include  that  input  to 

the  winner  rule  by  recalculating  rule 
parameters,  i.e.  count,  mean,  sum  of 
square,  and  standard  deviation, 
o  If  the  winner  rule  does  not  satisfy  the  cri¬ 
teria  (i.e.  the  target  of  the  input  pattern  is 
not  the  same  as  the  predicted  class)  then 
create  a  new  rule: 

-  Set  each  element  of  the  input  pattern  to 
be  the  centers  or  means  of  the  Gaussian 
functions,  i.e.  set  the  mean  of  each 
node  equal  to  each  element  of  the 
input  pattern; 

-  Set  the  standard  deviation  of  each 
node  equal  to  the  initial  standard 
deviation; 

-  Set  the  sum  of  square  for  each  node 
equal  to  each  element  in  the  input  pat¬ 
tern; 

-  Construct  the  new  membership  function 
of  the  consequent  part  using  the  corre¬ 
sponding  target  of  the  present  input 
pattern  to  be  the  center  of  the  member¬ 
ship  function. 

•  STEP  4:  Final  prediction 

o  Apply  the  aggregation  method  using  the 
;  fuzzy  “OR”  operator, 
o  To  obtain  crisp  class  output,  apply  the 
defuzzification  method  using  MOM 
method. 

o  If  there  is  no  further  input  then  STOP, 
otherwise  go  to  STEP  2. 


4.  Simulation  results 

To  demonstrate  the  performance  of  the  neu- 
rofuzzy  classifier,  software  simulations  were 


used  in  our  experiments.  The  simulation  pro¬ 
grams  were  written  to  run  under  MATLAB 
version  5.3  or  higher.  A  Pentium  233MMX  PC 
hosted  the  simulation  programs.  Two  data  sets 
were  used  for  training  and  testing  the  classifier 
in  our  studies.  The  first  benchmark  data  set 
was  the  well-known  Fishers  Iris  data  set  [19]. 
Another  data  set  was  a  time-series  vibration 
data  set  known  as  the  Westland  vibration  data 
set  [20].  The  details  of  these  experiments  are  as 
follows. 

4.1,  Fisher's  Iris  flower  data  set 

The  Fisher's  Iris  flower  data  set  consists  of  150 
patterns  and  four  features:  sepal  length,  sepal 
width,  petal  length,  and  petal  width.  These  four 
features  describe  the  shape  and  size  of  the  Iris 
flowers.  Each  pattern  in  the  data  set  falls  into  one 
of  three  classes:  Setosa,  Versicolour  and  Virginica, 
with  a  total  of  50  patterns  per  class.  For  the  pur¬ 
pose  of  this  experiment,  we  will  call  them  Class  1, 
Class  2,  and  Class  3,  respectively.  Class  1  is  line¬ 
arly  separable  from  the  other  two.  However,  Class 
2  and  Class  3  are  not  linearly  separable  from  each 
other. 

Fig.  3a  shows  the  scatter  plot  of  the  Iris  data  for 
sepal  width  and  length  features.  It  is  worth  noting 
from  the  plot  that  Class  1  can  be  easily  separated 
from  Class  2  and  Class  3.  However,  Class  2  and 
Class  3  seem  very  difficult  to  separate  since  there  is 
an  overlap  between  them.  Moreover,  in  Fig.  3b, 
the  petal  width  and  length  features  are  plotted 
showing  that  Class  1  is  very  well  separated  from 
Class  2  and  Class  3.  However,  Class  2  and  Class  3 
remain  overlapped  [19]. 

4. LI.  Classification  results  from  the  neurofuzzy 
classifier  on  Iris  data 

In  this  study,  10  trials  were  performed  by  ran¬ 
domly  selecting  the  order  of  the  data.  For  each 
trial,  the  training  set  was  composed  of  the  first  75 
patterns.  The  other  75  patterns  were  used  to  test 
the  classification  performance  of  the  trained  net¬ 
work.  In  this  study,  the  initial  standard  deviation, 
(To,  was  set  to  0.5.  The  constraint  of  the  updated 
standard  deviation,  oc,  was  set  to  0.01.  The  results 
of  the  study  are  shown  in  Table  1. 


302 


P.  Meescid.  G.G.  Yen !  ISA  Transactions  39  (2000  )  293-308 


Scatter  plot  of  sepal  width  and  length  features 


a) 


Scatter  plot  of  petal  wkjth  and  length  features 
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Fig.  3.  (a)  Scatter  plot  of  sepal  width  and  length  features  of  the  Fishers  Iris  data;  (b)  scatter  plot  of  petal  width  and  length  features  of 
the  Fisher's  Iris  data. 


Table  1 

The  classification  performance  of  the  neurofuzzy  classifier  on  Iris  data 

No.  trials  1  23  45  6  7  89  10 

No.  rules  18  9  9  11  19  9  11  10  11  11 

%  correct  97.33  88  97.33  96  90.67  97.33  93.33  92  94.67  96 

No.  wrong  classes  2  92  37  2  5  64  3 


Using  75  training  patterns,  the  neurofuzzy  clas¬ 
sifier  performed  one  pass  to  automatically  gen¬ 
erate  fuzzy  if-then  rules  (or  hidden  nodes).  From 
Table  1,  the  minimum  and  maximum  numbers  of 
rules  generated  were  nine  rules  and  19  rules, 
respectively.  The  minimum  correct  classification 
achieved  was  88%  with  nin^  patterns  for  wrong 
classification.  A  maximum  of  97.33%  correct 
classification  was  achieved  in  this  data  set.  It  is 
worth  noting  that  neurofuzzy  classifier  on-line, 
incrementally  learns  in  only  a  single  pass  through 
all  training  patterns.  The  variations  of  percentages 
of  correct  classification  are  due  to  the  sensitivity  of 
the  order  of  presentation  of  the  training  input.  By 
presenting  different  input  orders,  the  network 
constructs  different  rules.  In  addition,  it  yields 
different  performances.  However,  with  the  max¬ 
imum  percentage  of  correct  classification,  the  per¬ 
formance  of  the  proposed  neurofuzzy  classifier  for 
Iris  data  set  was  as  good  as  many  well-known 
classifiers  (see  [1]  for  references). 


4.2 .  The  Westland  vibration  data  set 

This  data  set  consists  of  vibration  data  recorded 
using  eight  accelerometers  mounted  on  different 
locations  on  the  aft  main  power  transmission  of  a 
US  Navy  CH-46E  helicopter.  The  CH-46E  Chi¬ 
nook  is  a  twin-rotor,  fore/aft  transmission  rotor- 
craft  powered  by  two  turbine  engines.  The  data  set 
is  archived  at  the  Applied  Research  Laboratory 
(ARL)  of  Penn  State  University.  The  Westland 
vibration  data  was  collected  by  using  an  Interna¬ 
tional  Recording  Instruments  Group  analog  tape 
recorder  and  a  single  mixbox  and  aft  main  trans¬ 
mission  installed  on  a  test  stand  and  run  at  nine 
different  torque  levels  (i.e.  27,  40,  45,  50,  60,  70, 
75,  80  and  100%).  While  collecting  the  data,  only 
one  faulted  component  was  installed  in  the  mix- 
box  and  transmission  at  a  time  and  vibration  data 
was  recorded.  The  data  was  recorded  for  seven 
types  of  faults  and  one  “no  fault,”  as  listed  in 
Table  2.  Employing  a  10-channel  data  acquisition 
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Table  2 

A  list  of  the  fault  types  created  in  the  test  gearbox 


Fault  no. 

Description 

n 

Epicvclic  planet  gear  bore/bearing/inner 
race  corrosion  spalling 

3 

Spiral  bevel  input  pinion  bearing  journal 
corrosion  pitting/spalling 

A 

Spiral  bevel  input  pinion  gear  tooth 
spalling/  scuffing 

5 

High  speed  helical  input  pinion  tooth  chipping 
and  freewheel  clutch  bearing  faise  brinnellmg 

6 

Helical  idler  gear  crack  propogation 

7 

Collector  gear  crack  propogation 

8 

Quill  shaft  crack  propogation 

9 

No  defect 

system,  the  data  was  digitized  at  a  sample  rate  of 
103,116.08  Hz  with  a  16-bit  quantization  level  and 
was  saved  in  1.506  MB  data  files.  All  together, 
there  are  71  files  with  each  file  containing  all  eight 
accelerometer  signals.  The  data  files  used  in  this 
study  were  1-s  data  files  [20]. 

4.2.  1.  Westland  data  characteristics 
Figs.  4a  and  b  show  two  samples  of  vibration 
data  in  the  time  domain  pertaining  to  fault  Class  2 
and  Class  3  from  Accelerometer  1  of  the  Westland 
Data  Archive.  However,  it  is  difficult  to  dis¬ 
criminate  the  two  raw  time-series  data.  The  raw 
time  series  data  provides  little  information  to  use 
for  classification.  It  is  preferable  to  transform  the 
signal  from  the  time  domain  to  the  frequency 
domain  by  using  the  fast  Fourier  transform  (FFT) 
technique.  The  vibration  signatures  in  frequency 
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domain  are  shown  in  Figs.  5a  and  b.  which  are 
power  spectral  density  plots  of  the  two  signals 
given  in  Figs.  4a  and  b,  respectively.  It  is  easy  to 
see  that  frequency  content  above  20  kHz  is  less 
useful.  The  effective  information  for  classification 
is  in  the  frequency  range  of  3-10  kHz.  For  the 
interested  frequency  range  of  0-12  kHz.  Figs.  6a 
and  b  illustrate  a  ikzoom-in"  version  of  the  power 
spectrum  density  plot  shown  in  Figs.  5a  and  b, 
respectively. 

More  sample  plots  in  the  frequency  domain  for 
100%  torque  level  of  the  Westland  vibration  data 
set  is  shown  in  Fig.  7.  Fig.  7  shows  sample  pat¬ 
terns  of  faults  2,  3,  4,  5,  6,  7,  8,  and  L'no  fault’' 
from  all  eight  accelerometers.  It  is  worth  noting 
that  data  from  each  sensor  alone  is  not  sufficient 
to  classify  all  fault  classes.  Moreover,  it  is  easier  to 
classify  the  data  by  using  all  patterns  obtained 
from  all  eight  sensors.  In  this  study,  most  of  our 
experiments  used  the  combined  signatures  from  all 
eight  serisors  as  training  patterns. 

4.2.2.  The  performance  of  the  neurofuzzy  classifier 
on  the  Westland  data  set 

In  our  experiments,  vibration  time-series  data 
was  preprocessed  using  the  FFT  technique  to 
transform  from  the  time  domain  to  the  frequency 
domain.  Power  spectrum  command  (SPECTRUM 
in  Matlab  Signal  Processing  Toolbox)  with  a 
Hanning  window  of  1024  samples  was  utilized. 
The  data  was  filtered  with  the  interested  frequency 
band  of  3-10kHz,  obtaining  a  141x1  vector  for 
each  channel.  Vectors  from  the  eight  channels 
were  set  into  one  vector  (1 128  x  1  vector).  A  list  of 
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Time  index  (10 
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Fig.  4.  (a)  A  plot  of  time  series  data  of  fault  2  from  sensor  1;  (b)  a  plot  of  time  series  data  of  fault  3  from  sensor  1. 
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Fig.  5.  (a)  Power  spectrum  density  db  plot  of  fault  2  from  sensor  1:  (b)  power  spectrum  density  db  plot  of  fault  3  from  sensor  1. 
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Fig.  6.  (a)  Power  spectrum  density  plot  of  fault  2  from  sensor  1;  (b)  power  spectrum  density  plot  of  fault  3  from  sensor  1. 


torque  levels,  fault  types,  and  number  of  patterns 
used  in  this  study  are  shown  in  Table  3. 

In  this  study,  the  initial  standard  deviation,  00, 
was  set  to  0.5.  The  constraint  of  the  updated 
standard  deviation,  <rc,  was  set  to  0.01.  Table  4 
shows  the  results  of  the  simulation  study  for  the 
performance  of  the  neurofuzzy  classifier  on  the 
Westland  vibration  data.  All  torque  levels  (i.e.  27, 
40,  45,  50,  60,  70,  75,  80,  and  100%)  were  used 
both  for  training  and  testing  to  the  neurofuzzy 
classifier.  When  the  same  torque  level,  only  100 
patterns  were  used  for  training,  and  the  remaining 
patterns  were  used  for  testing.  All  available  pat¬ 
terns  were  used  for  training  when  different  torque 
levels  were  used  for  testing.  For  the  last  column  of 
Table  4,  100  patterns  from  every  torque  level  were 


used  for  training.  Then  all  available  patterns  were 
used  for  testing. 

The  complete  simulation  results  of  the  neuro¬ 
fuzzy  classifier  on  the  Westland  vibration  data  are 
shown  in  Table  4.  In  Table  4,  the  columns  repre¬ 
sent  the  “training  data”  with  different  torque 
levels,  and  the  rows  indicate  the  “testing  data” 
with  different  torque  levels.  A  percentage  of  cor¬ 
rect  classification  is  interpreted  by  crossing  each 
column  with  each  row.  For  instance,  100%  correct 
classification  was  achieved  when  the  neurofuzzy 
network  was  trained  by  the  40%  torque  level  and 
was  tested  by  the  45%  torque  level.  The  numbers 
of  rules  resulting  from  the  training  process  of  the 
fuzzy  classifier  are  shown  in  Table  4  in  the  row 
leading  with  the  words  “Rules.”  These  numbers 
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Sensor  1  Sensor  2  Sensor  3  Sensor  4  Sensor  5  Sensor  6  Sensor  7  Sensor  8 
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Fig.  7.  Power  spectrum  density  plot  of  100%  torque  load  with 
different  faults  from  eight  sensors. 


Table  3 

A  list  of  torque  levels,  fault  types,  and  number  of  patterns  used 
in  this  study 

Torque  Fault  type  Number  of 

levels  (%)  patterns 


21 

N/Aa 

3 

4 

N/A 

N/A 

7 

8 

9 

400 

40 

N/A 

3 

4 

N/A 

N/A 

7 

8 

9 

700 

45 

N/A 

3 

4 

N/A 

N/A 

N/A 

8 

9 

350 

50 

N/A 

3 

4 

N/A 

N/A 

7 

8 

9 

700 

60 

N/A 

3 

4 

N.A 

N/A 

7 

8 

9 

500 

70 

N/A 

3 

4 

5 

6 

7 

8 

9 

500 

75 

N/A 

3 

4 

5 

6 

7 

8 

9 

400 

80 

N/A 

3 

4 

5 

6 

7 

8 

9 

500 

100 

2 

3 

4 

5 

6 

7 

8 

9 

500 

a  N/A,  not  available. 

indicate  how  many  rules  or  hidden  nodes  that  the 
neurofuzzy  classifier  has  created.  For  the  last  col¬ 
umn  of  Table  4,  trained  by  100  patterns  from 
every  torque  levels,  the  neurofuzzy  classifier  gen¬ 
erated  36  rules  (i.e.  hidden  nodes)  with  100%  cor¬ 
rect  classification  for  all  torque  levels  when  testing 
by  all  available  patterns.  It  can  be  found  that  the 
proposed  neural  fuzzy  classifier  generalizes  better 
at  low  torque  levels. 

The  proposed  neurofuzzy  classifier  used  only 
one  pass  to  learn  all  the  training  patterns.  It  is 
worth  noting  that  the  hidden  nodes,  or  fuzzy  if- 
then  rules,  are  automatically  created  equal  to  the 
number  of  fault  classes  existing  for  each  torque 


level.  This  shows  that  the  proposed  classifier  can 
deduce  the  optimum  rule  bases  from  this  data  set. 
The  results  have  shown  that  the  proposed  neuro¬ 
fuzzy  classifier  performs  very  well  on  classifying 
fault  classes  with  100%  correct  classification. 

For  comparisons,  two  classifiers  were  used  for 
comparison  with  the  proposed  neurofuzzy  classi¬ 
fier  for  the  Westland  vibration  data  set.  The  first 
network  was  the  multilayer  perceptron  (MLP) 
neural  network  trained  by  the  backpropagation 
with  variable  learning  rates.  The  MLP  network 
was  comprised  of  one  hidden  layer  with  10  hidden 
nodes  and  one  output  layer  with  four  nodes.  Log- 
sigmoidal  functions  were  utilized  in  the  MLP  net¬ 
work.  The  sum  of  square  error  (SSE)  goal  was  set 
to  0.001.  The  MLP  network  was  trained  for  10 
trials.  To  meet  the  SSE  goal,  the  MLP  network 
used  a  training  time  of  414  iterations  with  21  min 
on  the  average  of  10  runs.  The  second  network 
was  the  radial  basis  function  (RBF)  network  con¬ 
structed  '  using  every  point  of  training  patterns  to 
set  as  the  centers  of  Gaussian,  i.e.  the  weight  in  the 
radial  basis  layer.  The  output  weight  was  deter¬ 
mined  from  linear  equation  (this  was  done  using 
function  “newrbe”  from  MATLAB  Neural  Net¬ 
work  Toolbox  version  3). 

This  experiment  was  performed  using  200  pat¬ 
terns  of  100%  torque  level  to  train  the  selected 
classifiers.  The  testing  data  sets  were  composed  of 
the  remaining  200  patterns  from  100%  torque 
load,  700  patterns  from  80%  torque,  350  patterns 
from  75%  torque  load,  and  700  patterns  from 
70%  torque  load.  The  data  used  were  1128- 
dimensional  vectors  that  were  combined  from  all 
eight  sensors.  To  measure  the  computational 
complexity,  functions  “tic”  and  “toe”  in  Matlab 
were  utilized  to  average  the  learning  time  with  10 
runs  on  each  network.  The  proposed  neurofuzzy 
network  incrementally  learned  and  generated  eight 
clusters  i.e.,  eight  rules  in  the  hidden  layer.  The 
training  time  was  about  2  min  within  a  single 
iteration.  On  this  data  set,  the  neurofuzzy  spent 
shorter  training  time  than  the  MLP  and  RBF  net¬ 
works  did.  The  computational  time  was  4  min  to 
construct  the  RBF  network.  The  results  of  the 
simulations  are  shown  in  Table  5. 

From  Table  5,  it  is  found  that  the  neurofuzzy 
and  the  RBF  network  achieved,  for  100%  torque 
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Table  4 

Percent  correct  classification  of  the  neurofuzzy  classifier  for  the  Westland  vibration  data;i 


Training  data  (torque  levels) 


I  esting  data 
(Torque  levels) 

27% 

40% 

45% 

50% 

60% 

70% 

75% 

80% 

100% 

All  torque  ievels 

Number  of  rules 

5 

5 

4 

5 

5 

7 

n 

7 

8 

36 

21% 

100 

100 

99.5 

SO 

60 

79.2 

60 

59.2 

2S 

100 

40% 

100 

100 

100 

100 

73.2 

77 

54 

40 

20 

100 

45% 

86 

100 

100 

100 

81.5 

62 

30.5 

25 

25 

100 

50%  .  ' 

S7.6 

73.6 

61.75 

100 

91 

60 

40.6 

40 

40 

100 

60% 

78.6 

40 

61.75 

68.8 

100 

64 

60.2 

40 

40 

100 

70% 

53.6 

40 

50 

40 

60.4 

100 

99.86 

71 

44.57 

100 

75% 

40 

40 

50 

40 

40 

93.14 

100 

71.43 

71.43 

100 

80% 

60 

40 

50 

40 

40 

45.29 

67.71 

100 

62.14 

100 

100% 

40.4 

20 

50 

40 

40 

32.86 

42.57 

62.57 

100 

100 

a  100  patterns  were  used  for  training  when  patterns  from  the  same  level  were  used  for  testing.  All  patterns  from  each  torque  level 
were  used  for  training  when  patterns  from  different  levels  were  used  for  testing.  For  the  last  column,  100  patterns  from  each  torque 
load  level  were  used  for  training. 


Table  5 

Percent  correct  classification  of  the  MLP,  RBFN,  and  the  neurofuzzy  classifier,  trained  by  100%  torque  level  and  tested  by  different 
torque  levels 


Classifier  types  (trained  with  100%  torque  level) 

MLP 

RBFN 

Neurofuzzy 

Learning  type 

Off-line 

Off-line 

On-line 

Learning  time 

414  epochs,  21  min 

1  epoch,  4  min 

1  epoch,  2  min 

Test  level  ( torque  level)  (%) 

100 

98.75 

100 

100 

80 

58.71 

16.57 

62.14 

75 

61.14 

0.57 

71.71 

70 

37.43 

0 

46.43 

level,  100%  correct  classification,  while  the  MLP 
network  achieved  only  98.75%.  For  the  general¬ 
ization  on  different  torque  levels,  i.e.  80,  75,  and 
70%  torque  levels,  the  neurofuzzy  shows  better 
performance  than  the  MLP  and  RBF  networks. 


5.  Conclusions 

A  novel  neurofuzzy  network  for  pattern  classifi¬ 
cation  has  been  proposed.  The  neurofuzzy  classi¬ 
fier  based  on  neural  networks  and  fuzzy  set  theory 
is  a  self-organized  classifier.  The  network  auto¬ 


matically  deduces  fuzzy  if-then  rules  from  the 
numerical  data.  The  proposed  classifier  synergisti- 
cally  integrates  the  standard  fuzzy  inference  sys¬ 
tem  and  a  one-pass  supervised  learning  concept  of 
neural  networks.  Gaussian  radial  basis  functions 
are  used  as  the  membership  functions  both  in  the 
antecedent  and  the  consequent  parts  of  the  fuzzy 
inference  system.  The  neurofuzzy  classifier  on-line 
incrementally  learns  training  patterns  with  only 
one  pass. 

The  network  has  been  validated  using  two 
benchmark  data  sets:  the  Fisher’s  Iris  data  set  and 
the  Westland  vibration  data  set.  The  results  have 
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shown  that  on  the  Iris  data  set  the  proposed  clas¬ 
sifier  can  classify  the  testing  data  set  with  97.33% 
correct  classification.  For  the  Westland  vibration 
data,  the  neurofuzzy  classifier  achieved  100%  cor¬ 
rect  classification  when  using  the  all  torque  levels 
for  training  and  testing.  It  is  worth  mentioning 
that  the  proposed  classifier  is  equipped  with  an 
on-line,  one-pass,  incremental  learning  rule.  As  a 
result,  it  is  suitable  for  applications  that  demand 
to  on-line  detect  new  fault  scenarios  such  as  in 
machine  condition-based  monitoring  and  any 
application  that  prefers  a  short  training  time  yet 
classification  performance  is  comparable  or  even 
superior  to  conventional  approaches. 

To  improve  the  neurofuzzy  classifier,  some 
future  works  are  needed  to  be  pursued  as  follows. 
First,  the  proposed  classifier  is  sensitive  to  the 
order  of  the  presentation  of  the  training  patterns 
(i.e.  different  orders  of  presentation  result  in  dif¬ 
ferent  rules  and  classification  performance).  It  is 
preferable  to  use  fewer  rules  with  higher  classifi¬ 
cation  performance.  In  order  to  select  a  better  rule 
set,  a  number  of  trials  may  be  needed  by  present¬ 
ing  the  different  random  order  of  training  data  for 
each  trial.  Then,  select  the  rule  set  that  have  fewer 
rules  with  higher  generalization  for  the  test  set. 
This  will  not  spend  an  expensive  computational 
time,  since  each  trial  needs  only  a  quick  single 
iteration  for  training. 

Second,  if  the  network  seems  to  have  too  many 
rules,  a  rule-pruning  algorithm  should  be  incor¬ 
porated  into  the  network  to  reduce  the  number  of 
rules.  Also  the  initial  standard  deviation  for  the 
membership  function  is  heuristically  chosen, 
usually  set  to  be  a  small  value  such  as  0.1.  There 
should  be  a  mechanism  that  can  automatically 
determine  the  best  choice  of  the  initial  standard 
deviation.  Moreover,  there  is  no  mathematical 
proof  of  convergence  of  the  proposed  neurofuzzy 
network.  (Notice  that  the  network  is  a  one-pass 
and  incremental  learning  algorithm,  so  it  is  not 
easy  to  derive  mathematical  proof  of  convergence.) 
Finally,  using  different  membership  functions  both 
in  the  antecedents  and  the  consequents,  different 
“AND”  and  “OR”  operators,  and  different  defuz¬ 
zification  methods  may  improve  the  classification 
performance.  These  tasks  are  left  for  future 
research. 
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Abstract — An  innovative  n  euro -fuzzy  network  appropriate  for 
fault  detection  and  classification  in  a  machinery7  condition  health 
monitoring  environment,  is  proposed.  The  network;  called  an 
incremental  learning  fuzz]7  neural  (ILFN)  network,  uses  localized 
neurons  to  represent  the  distributions  of  the  input  space  and 
is  trained  using  a  one-pass,  on-line,  and  incremental  learning 
algorithm  that  is  fast  and  can  operate  in  real  time.  The  ILFN 
network  employs  a  hybrid  supervised  and  unsupervised  learning 
scheme  to  generate  its  prototypes.  The  network  is  a  self-organized 
structure  with  the  ability7  to  adaptively  learn  new  classes  of  failure 
modes  and  update  its  parameters  continuously  while  monitoring 
a  system.  To  demonstrate  the  feasibility  and  effectiveness  of  the 
proposed  network,  numerical  simulations  have  been  performed 
using  some  well-known  benchmark  data  sets,  such  as  the  Fisher’s 
Iris  data  and  the  Deterding  vowel  data  set  Comparison  studies 
with  other  well-known  classifiers  were  performed  and  the  ILFN 
network  was  found  competitive  with  or  even  superior  to  many 
existing  classifiers.  The  ILFN  network  was  applied  on  the  vibra¬ 
tion  data  known  as  Westland  data  set  collected  from  a  U.S.  Navy 
CH-46E  helicopter  test  stand,  in  order  to  assess  its  efficiency 
in  machinery  condition  health  monitoring.  Using  a  simple  fast 
Fourier  transform  (FFT)  technique  for  feature  extraction,  the 
ILFN  network  has  shown  promising  results.  With  various  torque 
levels  for  training  the  network,  100%  correct  classification  was 
achieved  for  the  same  torque  levels  of  the  test  data. 

Index  Terms — Fuzzy  neural  network,  incremental  learning,  ma¬ 
chine  health  monitoring,  pattern  classification. 

I  Introduction 

MODERN  engineering  technology  is  leading  to  increas¬ 
ingly  complex  machinery  with  ever  more  demanding 
performance  criteria.  A  constant  need  in  prolonging  service 
life  and  manufacturing  yields  for  global  competition  calls  for 
an  even  higher  standard  in  structural  reliability7.  At  another 
extreme,  the  maintenance  and  sustainment  of  aging  cap¬ 
ital-intensive  infrastructures  demand  innovative  technology'  in 
condition-based  maintenance.  A  downsized  workforce,  a  de¬ 
clining  maintenance  budget,  and  a  desire  for  a  “better,  cheaper, 
.and  smarter”  solution  have  further  complicated  the  risk  man¬ 
agement  decisions.  However,  currently  used  diagnostic  systems 
that  rely  primarily  on  ingenious  sensor  innovations  or  healthy 
redundant  sensor  placements  to  provide  early  warning  are 
costly,  vulnerable  and  computationally  expensive  to  validate. 
Specialized  nondestructive  testing  equipment  and  procedures 
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are  often  local  in  nature,  heavy,  passive,  and  labor-intensive. 
State-of-the-art  on-board  vibration  monitoring  systems  (VMSs) 
m  sophisticated  defense  vehicles  and  civic  structures  serve 
tc  collect  only  vi Dra ti on  spectra  or  acoustic  emissions  for 
off-line  analysis.  Needless  to  say,  time-critical  decisions  due 
to  catastrophic  failures  are  often  left  unresolved.  To  increase 
the  ability  to  promptly  detect  and  predict  impending  failures 
and  catastrophic  breakdowns  in  the  complex  interrelated 
structures  of  plants,  vehicles,  and  processes,  a  fundamental 
health  usage  monitoring  (HUM)  system  using  newly  emerging 
computational  intelligence  technologies  was  proposed  [1].  This 
system  was  solely  based  upon  the  nondestructive  analysis  of 
vibration  signatures  and  acoustic  emissions.  This  model-free 
integrated  approach  advanced  from  frequency  analysis  and 
learning  theory  provided  analytical  redundancy  with  respect 
to  the  conventional  fault  detection,  identification,  and  accom¬ 
modation  (FDIA)  methodologies  that  mandate  an  established 
high-fidelity  •  dynamic  model.  The  decisions  made  were  then 
interpreted  in  order  to  facilitate  expert  maintenance  procedures 
for  emergency  services  as  well  as  routine  checkups  [2).  - 

In  the  machinery  health  monitoring  system  proposed  in  [1], 
pattern  classification  is  a  key  component  in  identifying  failure 
modes  induced  from  the  monitored  systems.  Service  and  main¬ 
tenance  can  be  promptly  and  correctly  performed  if  the  pat¬ 
tern  classifier  makes  an  accurate  decision.  While  operating,  me¬ 
chanical  components  generate  some  physical  parameters,  such 
as  temperature  or  pressure  variations,  electromagnetic  fields, 
acoustic  emissions,  or  vibration  spectra,  which  contain  infor¬ 
mation  about  the  state  of  the  machinery  health.  These  physical 
behaviors  are  sensed  by  a  transducer  array  to  obtain  data  which 
is  used  for  failure  diagnosis  and  prognosis.  Using  pattern  clas¬ 
sification  techniques,  signatures  (like  natural  frequency,  mode 
shape,  or  curvature  shape)  can  be  extracted  from  the  data  that 
contains  information  about  machine  defects  and  their  causes. 
With  the  accurate  decision  making  of  a  classifier  in  a  moni¬ 
toring  system,  machine  maintenance  can  be  performed  before 
catastrophic  failures  occur. 

When  a  machine  is  operating  properly,  the  physical  symp¬ 
toms,  such  as  vibrations,  are  generally  small  and  constant.  How¬ 
ever,  when  faults  develop  which  lead  to  significant  variations 
in  process  dynamics,  the  physical  signatures  (e.g.,  power  spec¬ 
trum  density,  natural  frequency,  and  mode  shape)  also  vary.  To 
detect  these  changes,  classical  off-line  iterative  learning  classi¬ 
fiers  are  proposed  to  supervise  the  monitored  system  [1],  [3], 
[4].  These  classifiers  have  a  drawback  in  that  they  generally  re¬ 
quire  a  long  training  time.  In  addition,  gradient-type  classifiers 
often  get  stuck  at  local  minima  and  are  unable  to  achieve  an  op¬ 
timum  solution.  Furthermore,  while  operating,  it  is  possible  that 
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novel  failure  modes  are  evolving  while  a  monitored  system  is 
running.  These  iaults  could  be  significantly  different  from  thos“ 
known  to  the  classifier.  These  new  ciasses'of  defects  need  to  b* 
promptly  cetected  and  distinguished  from  those  that  have  been 
trained  to  the  classifier.  Often,  the  monitored  system  generates 
multiple  defects  simultaneously.  Identificauon  of  these'muitiDie 
aerects  is  also  needed  in  order  to  perform  correct  actions  for 
maintenance.  Conventional  statistical  or  neural  classifiers  share 
known  denciencies  in  coping  with  the  problems  listed  above 
PHSl.  In  tins  paper,  we  propose  an  effective  fuzzv  neural  net- 
■  worl:  :aPabl£  of  solvuiI  tie  above  prooiems  and  appropnate  for 
a  concution-based  health  monitoring  system.  The  nroDosed  net- 
wori:  advanced  from  fuzzy  ARTMAP  architecture  [9]  incorpo¬ 
rates  the  Gaussian  membership  functions  and  provides  contin¬ 
uous  health  monitoring  based  upon  vibration  signatures. 

For  the  completeness  of  the  presentation,  the  remainder  of 
this  paper  is  organized  as  follows.  Section  D  provides  a  bnef 
literature  survey  for  pattern  classification  based  on  various 
approaches.  The  comprehensive  understanding  of  the  problem 
statement  and  deficiencies  of  existing  literature  then  leads  to 
Ae  clearly  defined  objectives  of  this  study  given  in  Section  m. 
Then,  the  proposed  network  architecture  and  the  learning  rule 
are  mmoduced  in  Section  IV.  To  demonstrate  the  effectiveness 
and-  efficiency  of  the  proposed  network,  numerical  simulations 
and  benchmark  comparisons  are  presented  in  Section  V.  Fi¬ 
nally,  Section  VI  provides  some  concluding  remarks  and  future 
research  directions. 


n.  Literature  Review 

Pattern  classification  forms  a  fundamental  solution  to 
different  problems  in  real-world  applications.  The  function 
o  pattern  classification  is  to  categorize  an  unknown  pattern 
mto  a  distinct  class  based  upon  a  suitable  similarity  measure 
Thus,  similar  patterns  are  designated  n  be  in  the  same  class 
while  dissimilar  patterns  are  classified  mto  different  classes 
Engineers  and  scientists  have  developed  various  methodolo¬ 
gies  to  deal  with  classification  problems.  These  conventional 
classification  techniques  make  use  of  statistical  decision 
theory,  neural  network  theory,  and  fuzzy  set  theory.  Because 
°5  ^  s™Phcity>  statistical  approaches,  such  as  Bayesian 
c  assifiers  [10],  are  still  widely  used.  To  handle  more  complex 
classificauon  problems,  neural  network  classifiers,  such  as  the 
multilayer  perceptom  network  (MLP)  [11],  the  learning  vector 

C12v’  fd  radial  basis  functi011  netw°rk 
(RBFN)  [13]-[16]  with  the  abilities  of  parallel  computation 

and  nonlinear  mapping  have  been  shown  to  be  more  suitable 
because  of  their  learning  and  generalization  abilities.  A  third 
classification  technique  incorporating  fuzzy  set  theory  [171 
to  handle  vague  information  has  been  extensively  applied  to 
pattern  classification.  The  main  advantage  of  fuzzy  classifica¬ 
tion  techniques,  such  as  fuzzy-rule-base  methods  [18]  fuzzv 
c-means  [19],  frizzy  k-nearest-neighbor  [20],  [21],  and  fuzzy 
decismn  tree  [22],  lies  in  the  fact  that  they  provide  a  soft 

T,att  3lWhlcb  15  a  value  ^  describes  the  degree  to  which  a 
pattern  fits  with  a  class. 

The  synergetic  integration  of  neural  networks  and  fuzzv 
sets  is  also  an  active  area  for  pattern  classification  research.  A 


growing  number  of  researchers  have  desisned  and  exa-~ 
various  forms  of  fuzzy-neural  or  neuro-fuzzy  networks^ 
idea  is  to  merge  the  capabilities  of  model-free  and  trair 
systems.  parallel  computation,  and  noise  tolerance  ofTe 
networks  with  the  ability'  of  dealing  with  imprecise  simatior 
tne  iuzzy  set  theory'.  The  integration  of  neural  networks  anc 
ruzzy  set  theory-  resuits  in  a  classifier  that  has  useful  prone: 
ox  both  neural  networks  and  fuzzy  sets.  Tne  combinario: 
neural  networks  and  fuzzy  sets  forms  a  hybrid  network 
Handles  partem  classification  problems  very-  effectively 
efficiently.  Because  of  their  massive  parallel  comDutatic 
units,  neural  networks  have  the  advantage  of  fast  comm 
non  so  that  it  is  possible  to  process  real-time  estimation 
extensive  information.  The  benefit  of  fuzzy  systems  lies 
eir  ability  to  handle  the  unclear  dam  usually  experienced 
real-world  problems  [17].  Fuzzy-  neural  networks' have  sho 
to  be  very  advantageous  in  dealing  with  realistic  proble 
in  real-life.  Some  examples  of  fuzzy  neural  networks  a 
neural-frizzy  systems  for  pattern  classification  problems  a- 
knowledge-based  fuzzy  MLP  [23],  neural-network-based  fuz 
classifier  [24],  neuro-fuzzy  system  [25],  adaptive  neural  fur 
inference  system  [26],  on-line  self-constructing  neural  far 
infere nc e  nerw °rk  (SONFIN)  [27],  fuzzy  mui-max  neur 
network  U8],  fuzzy  ART  neural  network  [29],  fuzzy  ARTM/ 

nT^T°rk  f9]*  Gaussian  ARTMAP  'neural  network  [30 
and  RBF  fuzzy  ARTMAP  neural  network  [31], 

While  the  other  networks  cannot  leam  incrementally  tb 
fuzzy  ARTMAP  learns  new  knowledge  without  having  to  re 
learn  all  the  patterns.  This  concept  is  appropriate  for  detectin 
new  faults  in  machine  health  monitoring  online. 

The  main  concept  of  the  fuzzy  ARTMAP  is  that  input  pat 
terns  are  presented  to  fuzzy  ARTQ  to  be  clustered  into  group, 
while  the  corresponding  targets  are  presented  to  fuzzy  ART - 
also  to  be  clustered  into  groups.  (Fuzzy  ART0  and  fuzzy  ART 
are  defined  in  the  fuzzy  ARTMAP  architecture  [9].)  Then  the 
two  modules  are  mapped  to  correct  input  and  output  parrs  vis 
a  map  field  module.  The  fuzzy-  ARTMAP  learns  to  classify  in¬ 
puts  by  a  fuzzy  set  of  features  or  a  pattern  of  fuzzy  member¬ 
ship  values  between  0  and  1.  A  hyperbox  is  used  to  represent 
a  decision  boundary  of  the  input  space.  Its  minimum  point  and 

defbie  a  ^box  fuzzy  set.  A  membership 
function  is  defined  with  respect  to  the  hyperbox  minimum  and 

Vf  AU^S  m  each  dimension.  Extensive  details  of  the 
fazzy  ARTMAP  neural  network  are  discussed  in  [9] .  Despite  the 
beneficial  property  of  on-line  incremental  learning,  some  draw¬ 
backs  of  the  fuzzy  ARTMAP  neural  network  presented  in  the 
literature  are  as  follows: 

1)  It  has  no  mechanism  to  avoid  overfitting  and  hence  should 
not  be  used  with  noisy  data. 

2)  In  the  frizzy  ART  system,  full  membership  functions  are 
allowed  to  overlap  for  each  hyperbox,  leading  to  the  con- 
ftision  of  decision  making  (for  example  a  pattern  can  have 
hill  membership  in  two  classes  at  the  same  time). 

i)  it  has  too  many  parameters  to  tune  in  the  network. 

2“*  ab°I6  deficiencies,  we  propose  a  new  network  that  will 

d  riven  f°n  m°“t0ling- Tbe  detaik  °f  specific  objectives 

are  given  in  Section  HI.  J 
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EL  Objectives  of  the  Research 

The  primary  objective  of  this  study  is  to  develop  a  method¬ 
ology  ror  pattern  classification  appropriate  for  machinery 
condition-based  health  monitoring  applications.  This  DroDosed 
new  classifier,  advanced  from  the  concepts  of  the  fuzzy 
ARTMAP  concerning  on-line  incremental  learning,  is  called 
an  incremental  learning  fuzz, y  neural  (ILFN)  network.  To 
overcome  some  of  known  deficiencies  of  statistical  classifiers, 
neural  classifiers,  fuzzy  classifiers,  and  fuzzy-neural  network 
classifiers,  the  ILFN  classifier  incorporates  the  following 
features. 

1)  A  Hybrid  Supervised  and  Unsupervised  Learning  Algo¬ 
rithm:  A  supervised  learning  algorithm  T9],  [ll]-[14]s  [32]  is 
used  when  the  corresponding  targets  are  known.  On  the  other 
hand,  when  the  corresponding  target  is  not  available,  an  unsu- 
pervised  learning  algorithm  [12],  [16],  [19],  [20],  [28]-[32]  is 
adopted  for  on-line  learning. 

2)  Fast ,  On-Line,  One-Pass,  Incremental  Learning:  Many 
well-known  neural  networks  and  conventional  pattern  clas¬ 
sification  techniques  use  off-line  (or  batch)  learning,  which 
assumes  all  training  patterns  and  their  targets  are  given.  On 
the  other  hand,  for  on-line  learning  only  one  training  pattern  is 
needed  at  a  time.  Thus,  on-line  learning  requires  less  memory 
than  off-line  learning  does.  Off-line  learning  tends  to  use  longer 
training  time.  A  one-pass  learning  algorithm  has  training 
patterns  presented  to  the  classifier  only  once.  Incremental 
learning  defines  the  capability  of  learning  new  classes  and 
quickly  refining  existing  classes  without  forgetting  learned 
information. 

3)  Ability  to  Detect  Ney\>  Classes  and  Label  Them  Differently 
from  the  Existing  Corresponding  Targets:  In  some  machinery 
health  monitoring  systems,  such  as  VMSs,  unanticipated  fault 
patterns  may  develop  while  the  systems  are  operating.  These 
new  patterns  need  to  be  promptly  detected  and  learned  by  the 
classifiers  in  order  to  prescribe  a  correct -action  for  maintenance. 
After  training,  traditional  classifiers  cannot  detect  the  difference 
between  the  learned  fault  patterns  and  unseen  fault  patterns. 
They  often  classify'  the  new  patterns  to  the  closest  learned  pat¬ 
terns  even  when  they  are  significantly  different.  This  may  lead 
to  a  misunderstanding  and  cause  incorrect  service. 

4)  Ability  to  Build  Decision  Boundaries  that  Separate 
Nonlinear  Separable  Problems:  Many  neural  classifiers  have 
overcome  the  nonlinear  separable  problems.  This  new  classifier 
should  also  provide  the  ability  to  build  the  decision  boundaries 
to  separate  both  linear  and  nonlinear  separable  classes. 

5)  Ability  to  Make  Decision  Boundaries  of  all  Overlapping 
Classes:  Bayesian  classifiers  are  generally  used  to  classify 
overlapping  classes;  however,  constructing  the  Bayesian  clas¬ 
sifiers  requires  knowledge  of  the  probability  density  function 
for  the  classes.  Even  if  the  probability  density  function  for  each 
class  is  unavailable  beforehand,  the  proposed  classifiers  should 
be  able  to  classify  overlapping  classes  by  employing  Gaussian 
neurons  with  adaptable  variances. 

6)  Nonpar ametric  Classifier:  Parametric  classifiers  need  a 
priori  information  about  the  probability  density  functions  of  an 
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Fig.  1.  Network  architecture  of  the  ILFN  classifier  in  the  suDervised  learning 
mode. 


input  pattern  space;  on  the  other  hand,  nonparametric  classifiers 
do  not  require  a  priori  information  available  [28], 

7)  Ability  to  Provide  Both  Soft  and  Hard  Classification  Deci¬ 
sions:  A  hard  decision  means  that  a  given  pattern  either  belongs 
to  or  does  not  belong  to  a  specific  class  prototype.  In  contrast, 
a  soft  decision  allows  a  given  pattern  to  belong  to  more  than 
one  class  prototype  with  different  membership  grades  [28],  It  is 
possible  to  detect  multiple  defects  in  monitored  systems  if  a  soft 
decision  is  used. 

8)  Few  Tuning  Parameters:  Tuning  parameters  are  used  for 
controlling  a  learning  process,  and  there  should  be  as  few  pa¬ 
rameters  as  possible  to  tune  in  the  system  [28], 

IV.  Network  Architecture  and  Classification 
algorithm 

The  ILFN  network  is  advanced  from  the  fuzzy  ARTMAP 
basic  idea  of  on-line  and  incremental  learning  behavior.  The 
architecture  of  the  ILFN  network  fr  similar  to  the  fuzzy 
ARTMAP;  however,  in  details,  the  ILFN  network  operations 
are  completely  different  from  the  fuzzy'  ARTMAP.  While  the 
fuzzy  ARTMAP  uses  hyperbox  membership  functions,  the 
ILFN  network  employs  Gaussian  membership  functions  that 
can  prevent  full  membership  of  overlapping  classes.  The  ILFN 
network  architecture  is  detailed  as  follows. 

The  architecture  of  the  ILFN  network  is  distinguished  by 
two  different  modes:  a  supervised  learning  mode  (as  shown  in 
Fig.  1)  and  an  unsupervised  learning  mode  (as  shown  in  Fig.  2). 
The  two  modes  have  differences  only  in  the  controller  module 
and  the  target  labeling  module.  Whereas  the  supervised  leammc 
mode  requires  pairs  of  input  and  target  of  patterns  to  construct 
prototypes  of  the  network,  the  unsupervised  learning  mode  uses 
the  target  labeling  module  to  assign  the  target  class  for  a  given 
input  partem. 

The  ILFN  network  has  four  layers: 

1)  one  input  layer; 

2)  one  hidden  layer; 

3)  one  output  layer; 

4)  one  decision  layer,  as  shown  in  Figs.  1  and  2. 
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learning  mods. 


Generally,  the  first  three  layers  of  the  system  are  composed  of 

two  subsystems:  an  input  subsystem  and  a  target  subsystem. 

These  subsystems  are  linked  by  three  connections: 

1)  the  controller  module  in  the  hidden  layer; 

2)  the  pruning  modules  in  both  the  input  subsystem  and  the 
target  subsystem  of  the  output  layer; 

3)  decision  layer  which  is  the  link  between  the  membership 
module  in  the  input  subsystem  and  the  target  module  in 
the  target  subsystem. 

The  following  sections  present  the  details  of  the  input  sub¬ 
system,  the  target  subsystem,  the  controller  module  as  well  as 
the  fourth  layer,  the  decision  layer. 

A.  Input  Subsystem 

Fig.  3  illustrates  the  input  subsystem  of  the  ILEN-  classifier. 
The  input  subsystem  is  composed  of  three  parts: 

1)  a  variable  p  in  the  input  layer; 

2)  a  Gaussian  membership  function  variable  weight  WP  in 
the  hidden  layer; 

3)  a  pruning  module  and  a  membership  module  in  the  output 
layer. 

In  the  input  layer,  an  element  of  an  input  vector  p  connects 
to  each  neuron.  Neurons  of  the  input  layer  are  fully  connected 
to  neurons  of  the  hidden  layer  via  a  dynamic  synaptic  weight 
matrix  WP,  whose  rows  represent  prototype  vectors  which  are 
the  centroids  of  radial  basis  functions  in  the  hidden  layer.  WP 
is  a  trainable  weight  using  learning  rules  that  will  be  discussed 
later.  WP  grows  when  a  new  prototype  is  detected.  Additional 
rows  are  added  to  WP  each  time  a  neuron  is  added  to  the  hidden 
layer. 

In  the  hidden  layer  of  the  ILEN  network,  Gaussian  member¬ 
ship  functions  are  used.  The  Gaussian  functions  are  centered 
on  die  mean  vectors  of  clusters  which  are  called  prototypes  of 
the  input  pattern  space.  The  Gaussian  membership  functions  are 
employed  to  fuzzify  the  input  vectors  p,  into  membership  values 
m*  ’ witil  aspect  to  the  distance  measure  between  the  input  vec¬ 
tors  p,  and  the  existing  prototypes.  The  membership  function  at 
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rig.  3.  Input  subsystem  of  the  ILFN  classifier. 

the  2th  neuron  m*( p.  wPl),  is  defined  by  the  following 
tiom 


TOj(p,  wPi)  ss  exp  (  — 


where  ||  •  1 1  denotes  the  Euclidean  distance  which  is  used  as  a  sii 
ilarity  measure  between  two  vectors.  The  weight  vector  betwe 
the  input  layer  and  the  2th  hidden  neuron  wPi  is  the  center 
mean  vector  at  the  2th  neuron  in  the  hidden  layer,  a  represer. 
the  standard  deviation  of  the  2  th  neuron  in  the  hidden  layer.  T; 
membership  function  m;(p,  wPi)  of  the  hidden  layer  is  used 
fuzzify  the  distance  between  a  given  input  vector  p  and  the  i: 
centers  wPi  into  a  real  value  m.;  which  represents  the  degrees 
similarity  between  p  and  wPi.  The  membership  functions  pre 
duce  localized,  bounded,  and  radially  symmetric  kernels.  Th 
value  of  these  membership  functions  monotonically  decrease 
as  the  distance  from  the  function’s  center  increases. 

In  the  ILFN  network,  a  class  may  have  several  prototypes 
These  prototypes  have  different  degrees  of  belonging  assignee 
to  a  pattern.  Only  one  prototype  with  the  highest  degree  of  be 
longing  is  needed  to  represent  a  pattern.  The  prototypes  witl 
lower  degrees  of  belonging  generate  redundant  classes.  To  elim¬ 
inate  redundant  classes,  the  pruning  module  is  used  in  the  outpu 
layer  of  the  ILEN  network.  Instead  of  passing  many  duplicatec 
classes,  only  distinguished  classes  are  passed  to  the  member¬ 
ship  module.  This  makes  the  system  easier  for  human  users  to 
interpret  the  output. 

The  pruning  procedure  of  the  HEN  network  is  different  from 
usual  pruning  procedures  that  eliminate  insignificant  neurons  or 
weights  [23],  The  pruning  module  used  in  the  ILEN  network  is  a 
short-term  memory.  Thus,  it  performs  separately  for  each  input 
pattern. 

In  addition,  the  pruning  module  in  the  input  subsystem  and 
the  pruning  module  of  the  target  subsystem  work  in  the  same 
way.  From  each  prototype,  the  maximum  membership  value'  in 
the  input  subsystem  is  selected  to  represent  the  degree  of  simi¬ 
larity  with  respect  to  a  class  in  the  target  subsystem. 

The  membership  module  in  the  output  layer  of  the  input 
subsystem  receives  information  transmitted  from  the  pruning 
module  and  passes  it  to  the  decision  layer.  The  information 
stored  in  the  membership  module  is  a  short-term  memory, 
which  means  that  the  information  in  the  membership  module 
differs  for  different  input  vectors.  Each  membership  value  in 
the  membership  module  indicates  the  degree  of  similarity  of  an 
input  vector  with  respect  to  the  target  classes  of  the 
The  membership  values  are  then  mapped  in  the  same  order  of 
indexes  to  classes  in  the  target  module  in  the  target  subsystem 
via  the  decision  layer. 
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Fig.  4.  Target  subsystem  of  the  HFN  classifier. 


E.  Target  Subsystem 


The  target  subsystem  of  the  ILFN  classifier  is  depicted  in 
Fig.  4.  Each  neuron  of  the  input  layer  in  the  target  subsystem 


is  fully  connected  to  each  element  of  a  target  vector.  A  synaptic 
weight  matrix  Wt.  which  needs  no  training,  is  used  to  connect 
the  neurons  of  the  input  layer  to  the  neurons  of  the  hidden  layer. 
An  additional  row  is  added  to  Wx  when  a  neuron  is  added  to 
the  hidden  layer.  These  hidden  neurons  of  the  target  subsystem 
are  activated  by  linear  functions. 

As  in  the  input  subsystem,  the  pruning  module  of  the  ouiput 
layer  in  the  target  subsystem  is  used  to  eliminate  redundant 
classes  in  the  hidden  layer.  Instead  of  passing  many  duplicate 
classes,  only  classes  with  prototypes  that  have  the  highest  de¬ 
gree  of  membership  for  a  given  input  are  passed  to  the  member¬ 
ship  module.  As  mentioned  before,  the  pruning  module  in  the 
target  subsystem  works  the  same  way  as  the  pruning  module  in 
the  input  subsystem  does. 

The  target  module,  which  is  in  the  output  layer  of  the  target 
subsystem,  receives  information  passed  from  the  pruning 
module  and  submits  it  to  the  decision  layer.  Each  neuron  of  the 
target  module  is  a  class  or  a  target  of  an  input  vector.  The  target 
module  is  a  short-term  memory  as  is  the  membership  module 
of  the  input  subsystem.  In  the  same  order  of  indexes,  the  target 
module  is  then  mapped  to  the  membership  module  of  the  input 
subsystem  via  the  decision  layer. 


winning  neuron  is  smaller  than  the  threshold  r  and  the  desired 
target  and  the  decision  outdui  are  the  same. 

in  the  unsupervised  learning  mode,  the  controller  module  of 
the  ILFN  classifier  has  only  one  comnonent  which  is  a  com¬ 
parator.  The  comparator  is  used  to  compare  the  winning  mem- 
oership  value  in  the  hidden  layer  to  the  threshold  :  The  outou: 
or  this  comparator  becomes  “true"  if  the  winning  membership 
value  is  smaller  than  a.  If  the  ouiput  of  the  comoarator  is  “true," 
meaning  that  a  new  category7  is  detected,  the  system  adds  a  new 
neuron  to  the  hidden  layer  using  the  inout  pattern  as  the  new 
prototype,  then  the  target  labeling  module  distinguishabiy  as¬ 
signs  a  corresponding  target  to  the  new  prototype. 

In  addition  to  a  comparator,  the  controller  module  in  the  unsu¬ 
pervised  learning  mode  also  has  a  target  labeling  module  used  to 
assign  a  target  for  a  new  prototype.  The  target  labeling  module 
receives  one  input  from  the  output  of  the  controller  module  in 
the  hidden  layer -of  the  target  subsystem.  This  input  from  the 
controller  module  tells  the  target  labeling  module  to  assign  a 
target  when  a  new  neuron  is  added  to  the  system.  Another  input 
of  the  target  labeling  module,  representing  targets  of  prototypes, 
is  used  to  check  the  existing  targets  in  order  to  assign  a  new 
target  that  differs  from  the  existing  targets. 

D .  Decision  Layer 

The  decision  layer  is  used  to  map  the  membership  values  in 
the  membership  module  of  the  input  subsystem  to  the  target 
classes  in  the  target  module  of  the  target  subsystem.  The  output 
from  the  decision  layer  is  the  output  of  the  system.  The  decision 
output  can  be  interpreted  as  a  soft  decision  or  a  hard  decision. 
For  the  soft  decision,  the  decision  output  assigns  different  mem¬ 
bership  values  to  the  pattern  classes  or  prototypes.  This  allows 
a  given  pattern  to  belong  to  more  than  one  class  with  different 
degrees  of  similarity',  measure.  For  the  hard  decision,  the  deci¬ 
sion  output  selects  only  one  class  with  the  highest  membership 
value. 


C.  Controller  Module 

The  controller  module  is  used  to  control  the  growth  of  neu¬ 
rons  in  the  hidden  layers  of  both  the  input  subsystem  and  the 
target  subsystem.  It  operates  differently  in  the  controller  module 
in  supervised  learning  mode  and  unsupervised  learning  mode. 

In  the  supervised  learning  mode,  there  are  three  components 
in  the  controller  module:  two  comparators  and  one  AND  gate. 
One  comparator  is  used  to  compare  the  winning  membership 
value  from  the  hidden  layer  of  the  input  subsystem  to  the 
threshold  t.  The  output  of  this  comparator  becomes  “true” 
if  the  winning  membership  value  is  smaller  than  the  e.  This 
implies  that  the  input  vector  is  significantly  different  from  all 
existing  prototype  vectors.  The  output  is  sent  to  one  input  of  the 
AND  gate.  Another  comparator,  which  has  two  inputs,  is  used 
to  compare  the  desired  target  to  the  predicted  output  which  is 
stored  in  the  hidden  layer  of  the  target  subsystem.  The  output 
of  the  comparator  becomes  “true"  if  both  the  desired  target  and 
the  predicted  output  are  the  same.  It  is  sent  to  another  input 
of  the  AND  gate.  If  both  inputs  of  the  AND  gate  are  “true," 
its  output  becomes  ‘true."  This  allows  the  system  to  add  one 
more  neuron  to  the  hidden  units.  In  other  words,  the  system 
generates  more  neurons  whenever  the  membership  value  of  the 


E.  ILFN  System  Dynamics 

Both  Wp  and  Wx  are  allowed  to  grow  when  the  system  de¬ 
tects  new  classes.  However,  only  Wp  can  adaptively  change 
its  information  or  learn  new  prototypes.  At  the  initialized  state, 
there  are  no  neurons  in  the  hidden  layer.  The  first  neuron  in  the 
hidden  layer  is  set  up  after  the  first  input  vector  p  is  presented  to 
the  input  subsystem  of  the  network  while  the  first  target  vector  t 
is  presented  to  the  input  layer  in  the  target  subsystem.  Then,  both 
Wp  and  WT  set  up  the  first  neuron  using  p  and  t,  respectively. 
The  next  input  vector  is  compared  to  the  existing  prototype.  If 
there  is  a  significant  difference  (depending  on  the  threshold  e), 
then  a  new  neuron  is  added  to  the  hidden  layer;  p  is  added  to 
Wp,  and  t  is  added  to  Wx.  On  the  other  hand,  if  the  input 
vector  meets  the  similarity  criterion  then,  instead  of  adding  a 
new  neuron,  the  learning  process  is  performed.  The  Wp  and 
other  parameters  are  updated  to  include  the  new  data  in  the  ex¬ 
isting  prototypes. 

F.  Learning  Process 

The  learning  process  takes  place  only  in  the  hidden  layer.  It 
adapts  the  synaptic  weight  Wp  and  updates  the  variables  re- 
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garaing  the  pattern  clusters  of  the  input  space.  In  the  learning 
process,  each  input  vector  p  from  the  input  space  is  fuzzmed  to  a 
membership  value  at  each  noae  of  the  hidden  layer  with  resDeci 
to  the  distance  measure  oerween  input  vector  p  and  the  svnantic 
weight  matrix  *W  P .  The  winning  node  of  the  hidden  layer  is  de¬ 
termined  by  the  defuzzification  process  using  the  fuzzy  OR  od- 
eration  (V)  defined  as 

winner  =  mi  V  m2  V  •  •  •  V  mr,  (2) 

J  =  winner  index  =  arg  max  (mi)  (3) 


rig.  5.  Decision  boundaries  among  prototypes  of  the  ILFN  ciassirier. 


where  nil  V  777.2  —  mi  if  mi  >  m2;  mi  Vmo  =  mo  if  mi  <  mo. 
Only  the  parameters  of  the  winner  node  (i.e.,  Jth  neuron)  in¬ 
cluding  count,  mean,  and  standard  deviation  are  updated,  while 
other  losing  nodes  remain  the  same,  as  follows: 


C'J.ne-vv  —  C/j,  old  -f  1 

=  wPJ,old(Cj,ne^  ~  1)  ri-  p 
Cj.  new 


Wpj 


(4) 

(5) 


new  =  < 


Li,  new 


1  (^J.new  p)“ 
®  J,  new 


otherwise 


(6) 


where  a  parameter  with  the  subscript  “old”  represents  that 
parameter  before  updating  and  a  parameter  with  the  subscript 
new  represents  that  parameter  after  updating.  Cj  represents 
the  number  of  patterns  that  have  been  counted  into  the  Jth 
prototype.  The  mean  wpj,  the  center,  or  the  Jth  prototype,  is  a 
row  in  the  synaptic  weight  Wp.  The  standard  deviation  crj  will 
be  used  to  indicate  the  spread  of  the  data  in  the  Jth  prototype. 
do  is  the  initial  standard  deviation  representing  the  isotropic 
spread  in  pattern  space  of  a  new  category  for  the  first  sample. 
a0  is  usually  chosen  small  enough  (e.g.,  a  value  between  0.001 
and  0.05)  to  include  only  the  pattern  that  is  setup  for  the  new 
prototype.  After  the  patterns  near  the  prototype  are  included 
in  the  same  prototype,  the  standard  deviation  cj  is  updated 
accordingly. 

Equations  (4)-(6)  are  learning  rules  used  to  update  the  pro¬ 
totype  variables  in  the  input  subsystem.  The  number  of  patterns 
belonging  to  each  cluster  is  updated  by  (4).  By  knowing  the  pre¬ 
vious  centers  and  the  number  of  patterns  that  belong  to  a  cluster, 
new  centers  can  be  calculated  by  (5).  The  estimated  standard 
deviations  [30]  can  be  calculated  if  the  previous  standard  devi¬ 
ation  and  the  number  of  the  patterns  belonging  to  a  cluster  are 
known.  Estimated  standard  deviations,  which  are  the  spread  of 
the  Gaussian  membership  functions,  are  determined  by  (6). 


G.  Decision  Boundaries 

The  purpose  of  pattern  classification  is  to  determine  to  what 
class  a  given  sample  belongs.  Through  an  observation  or  mea¬ 
surement  process,  a  set  of  numbers  which  make  up  the  obser¬ 
vation  vector  is  obtained.  The  observation  vector  serves  as  the 
input  to  a  decision  rule  by  which  the  sample  to  one  of  the  given 
classes  is  assigned. 


The  decision  boundaries  of  the  ILFTs  network  distinguish 
among  prototypes  in  the  Voronoi  tessellation  [12].  Each  pro¬ 
totype  has  its  own  region  separated  by  the  decision  boundaries. 
Since  the  ILFN  classifier  uses  Gaussian-type  membership  func¬ 
tions  with  different  standard  deviations,  the  soft  decision  bound¬ 
aries  of  the  ILFN  classifier  are  quadratic.  However,  the  hard  de¬ 
cision  boundary  between  the  neighboring  prototype  vectors  is 
a  hypeiplane  containing  the  points  that  have  the  same  degree 
of  the  membership  value,  as  shown  in  Fig.  5.  Fig.  5  shows^the 
decision  boundaries  among  prototypes  of  the -ILFN  network  in 
which  dotted  circles  indicate  the  spread  of  statistical  data  for 
each  prototype. 

H.  Classification  Algorithm 

The  ILFN  network  can  learn  in  two  different  ways:  1)  su¬ 
pervised  learning,  which  requires  both  input  patterns  and  corre¬ 
sponding  targets  and  2)  unsupervised  learning,  which  requires 
only  input  patterns  without  corresponding  targets  and  in  which 
the  target  labeling  module  will  assign  appropriate  class  labels. 
The  classification  algorithm  of  the  ILFN  classifier  is  listed  as 
follows: 

Step  1 )  Set  the  user-defined  threshold  parameter  (e) ,  the  ini¬ 
tial  standard  deviation  oo,  and  the  maximum  number 
of  patterns  allowed  in  each  cluster. 

Step  2)  Retrieve  the  first  input  pattern 
“  Use  the  first  input  pattern  p  to  set  up  the  first  prototype 
(or  mean)  to  WP. 

Set  the  number  of  patterns  for  the  first  node  to  be  1. 

Set  the  standard  deviation  equal  to  the  initial  standard 
deviation,  oq. 

-  Set  a  new  neuron  to  Wx  using  the  first  target  t  to  be 
the  corresponding  target  of  the  prototype  in  WP . 

Step  3)  Retrieve  the  next  training  sample  with  an  input  and 
target. 

Step  4)  Measure  the  Euclidean  distance  between  the  input  p 
and  the  prototype  WP . 

Step  5)  Calculate  membership  values  for  each  node  using 
the  Gaussian-type  radial  basis  function. 

Step  6)  Assign  membership  values  to  each  node.  The  current 
pattern  has  different  degrees  for  each  node  or 
prototype.  For  each  class,  select  the  maximum  mem¬ 
bership  value  from  each  prototype  to  represent  the 
degree  of  similarity  with  respect  to  that  class. 

Step  7)  Identify  the  largest  membership  using  the  fuzzy  OR 
operator. 

Step  8)  For  the  winning  node: 
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7/  there  is  the  corresponding  targe:  (Le.,  super¬ 
vised  learning  mode): 

a)  If  the  winner  is  larger  than  £  and  the  target  t 
is  tne  same  value  as  TV  x  gr  hie  winner  node 
then  update  weight  W  p  s  the  standard  devia* 
tion,  and  the  mimDer  of  patterns  belonging  to 
this  node. 

b)  If  1)  is  not  satisfied,  then: 

“  Set  a  new  node  center  for  *WP  using  the  inDUi  Dattem 

p. 

Set  the  number  of  patterns  for  the  new  node  to  be  1. 
Set  the  initial  standard  deviation  to  the  new  node. 

“  Add  a  new  neuron  to  "W  x  using  the  new  target  t  as  the 
corresponding  target  of  a  new  prototype  in  WP. 

If  there  is  no  corresponding  target  (i.e.}  unsuper¬ 
vised  learning  mode): 

a)  If  the  winner  is  larger  than  e  and  the  number 
of  patterns  is  less  than  the  maximum  number 
of  allowed  patterns,  then  update  the  weight 
WP)  the  standard  deviation,  and  the  number 
of  patterns  belonging  to  this  node.  Identify  the 
class  output  which  is  stored  in  WT  at  the  same 
index  of  the  winner  node  of  WP . 

b)  If  the  winner  is  smaller  than  £,  then: 

Set  a  new  node  center  for  WP  using  the  input  pattern 
P- 

Set  the  number  of  patterns  for  the  new  node  to  be  1. 

Set  the  initial  standard  deviation  to  the  new  node. 

“  a  new  neuron  to  Wx  and  assign  a  new  target  as 

the  corresponding  target  of  a  new  prototype  in  WP. 
(The  assigned  new  target  must  be  significantly  dif¬ 
ferent  from  the  existing  targets  already  stored  in  WT. 
For  example,  if  there  exist  targets  in  WT  =[12  3]T, 
the  new  target  should  be  “4,”  that  is  Wt  becomes 
[1  23  4]t.) 

Step  9)  If  there  are  no  more  input  patterns,  then  stop.  Other¬ 
wise,  go  to  Step  3. 

Usually,  if  the  user  knows  both  input  patterns  and  their  tar¬ 
gets,  the  network  is  trained  in  the  supervised  learning  mode. 
After  supervised  training,  the  network  is  used  in  a  pattern  clas¬ 
sification  system.  The  ILFN  network  can  detect  new  categories 
that  have  not  been  trained.  When  the  system  detects  new  cate¬ 
gories,  it  employs  the  unsupervised  learning  mode  by  using  the 
target  labeling  module  to  assign  the  corresponding  targets  to  the 
input  patterns.  The  targets  that  are  assigned  to  the  novel  proto¬ 
types  are  significantly  different  from  the  existing  targets  in  the 
target  module. 

V.  Simulations  and  Results 

To  demonstrate  the  performance  of  the  ILFN  classifier,  nu¬ 
merical  simulations  were  used  in  our  experiments.  The  simula¬ 
tion  programs  were  written  to  run  under  MATLAB  on  a  Pentium 
233MMX  PC.  Three  data  sets  were  used  for  training  and  testing 
the  classifier  in  our  studies.  The  first  benchmark  data  set  was  the 
well-known  Fisher’s  Iris  data  set  [33].  The  second  data  set  was 
a  vowel  data  set  [34].  The  vowel  data  set  is  electronically  avail¬ 
able  from  the  connectionist  benchmark  collection  at  Carnegie 
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Fig.  6.  (a)  Scatter  plot  of  sepal  width  and  length  features  of  the  Fisher’s  iris 

data,  (b)  Scatter  plot  of  petal  width  and  length  features  of  the  Fisher’s  iris  data. 

Mellon  University,  Pittsburgh,  PA  [35],  For  the  first  two  data 
sets  used  in  this  study,  the  results  have  shown  that  the  ILFN  clas¬ 
sifier  is  capable  of  learning  on-line  real-time  in  only  one  pass 
through  all  training  data.  In  addition,  the  prediction  capability 
of  the  ILFN  classifier  was  found  to  be  as  good  as  or  even  better 
in  many  cases  than  many  existing  classifiers.  With  the  ability 
of  fast,  one-pass,  on-line,  real-time,  incremental  learning,  the 
ILFN  network  is  found  to  be  applicable  in  real-world  applica¬ 
tions.  The  last  and  most  important  data  set  was  a  time-series 
vibration  data  set  known  as  Westland  vibration  data  [36],  The 
details  of  the  three  experiments  are  as  follows. 

A.  Fisher’s  Iris  Flower  Data  Set 

The  Fisher's  Iris  flower  data  set  consists  of  150  patterns  and 
four  features: 

1)  sepal  length; 

2)  sepal  width; 

3)  petal  length; 

4)  petal  width. 

The  four  features  describe  the  shape  and  size  of  the  irises.  F,ach 
pattern  in  the  data  set  falls  into  one  of  three  classes: 

1)  Setosa; 

2)  Versicolor; 

3)  Virginica; 
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TABLE  I 

COMPARISON  PERFORMANCE  BETWEEN  THE  ILPN  CLASSIFIER  AND  FUZZY  ARTMAP 


with  a  total  of  50  patterns  per  class.  For  the  purpose  of  this 
experiment,  we  will  call  them  Class  1,  Class  2,  and  Class  3, 
respectively.  Class  1  is  linearly  separable  from  the  other  two. 
However,  Class  2  and  Class  3  are  not  linearly  separable  from 
each  other. 

Fig.  6(a)  shows  the  scatter  plot  of  Iris  data  for  sepal  width 
and  length  features.  It  is  worth  noting  from  the  plot  that  Class 
1  can  be  easily  separated  from  Class  2  and  Class  3.  However, 
Classes  2  and  3  seem  very  difficult  to  separate  since  there  is  an 
overlap  between  them.  Moreover,  in  Fig.  6(b),  the  petal  width 
and  length features  are  plotted  showing  that  Class  1  is  very  well 
separated  from  Class  2  and  Class  3.  Again,  Class  2  and  Class  3 
remain  overlapped  [33], 

1)  A  Comparison  Between  the  ILFN  Classifier  and  the  Fuzzy 
ARTMAP:  In  this  study,  the  training  data  set  was  composed  of 
the  first  25  patterns  of  each  class,  while  the  testing  data  set  was 
composed  of  the  remaining  25  patterns  of  each  class.  Twenty 
trials  were  performed.  For  each  trial,  the  presentation  order  of 
the  training  data  was  randomly  selected.  To  compare  the  perfor¬ 
mance  of  the  ILFN  network  with  a  similarly  supervised  on-line 
incremental  learning  classifier,  the  fuzzy  ARTMAP  neural  net¬ 
work  was  used.  The  ILFN  classifier  and  the  fuzzy  ARTMAP 
were  trained  with  the  same  training  data  set.  Then,  both  net¬ 
works  were  tested  for  generalization  using  the  same  testing  data 
For  the  parameters  of  the  ILFN  classifier,  the  initial  standard  de¬ 
viation  a0  was  set  to  0.001  and  the  threshold  e  was  set  between 
0  and  1.  When  given  the  same  order  of  presentation  of  this  data 
set,  the  ILFN  network  yields  the  same  number  of  hidden  neu¬ 
rons  and  the  same  classification  performance  independent  of  t 
chosen.  The  parameters  of  the  fuzzy  ARTMAP  were  set  as  fol¬ 
lows:  the  vigilance  parameters  pa  =  0.5  and  pb  =  0.5,  and 

the  learning  rate  0  =  1.  The  results  of  the  study  are  shown  in 
Table  I. 

From  Table  I,  using  the  testing  data,  the  IFLN  achieved  a 
maximum  of  98.67%,  an  average  of  96.268%,  and  a  minin-imr, 


of  93.33%  correct  classification.  On  the  other  hand,  the  fuzzy 
ARTMAP  classifier  achieved  a  maximum  of  96%,  an  average 
of  93.467%,  and  a  minimum  of  90.67%  correct  classification. 
Moreover,  the  ILFN  classifier  used  only  one-iteration  learning 
through  all  training  data  while  the  fuzzy  ARTMAP  used  one 
to  four  iterations  to  leam  the  training  patterns.  However,  both 
algorithms  used  training  times  within  only  a  few  seconds.  For 
this  data  set,  the  number  of  nodes  of  the  ILFN  classifier  was  not 
sensitive  to  the  threshold  value  c;  thus,  different  values  of  a  (be¬ 
tween  0  and  I)  yielded  the  same  number  of  hidden  neurons  and 
the  ^  same  performance  of  correct  classification.  For  the  fuzzy 
ARIMAP,  the  number  of  hidden  neurons  was  very  sensitive  to 
the  choices  of  vigilance  parameters;  pa  and  pb. 

2)  Comparisons  Among  Other  Classifiers:  Table  II  shows 
the  classification  performance  among  other  classifiers  using  the 
Fisher  Iris  data.  The  classifiers  in  row  one  to  row  six,  reported  by 
Simpson  [28],  show  that  most  of  the  classifiers  were  able  to  pre¬ 
dict  testing  data  with  the  number  of  incorrect  classification  from 
two  to  four.  (See  details  in  [28]  on  how  to  construct  the  training 
and  the  testing  data  for  these  experiments.)  It  is  worth  men¬ 
tioning  that  those  classifiers,  except  the  fuzzy  min-max  classi¬ 
fier,  cannot  learn  on-line.  The  fuzzy  min-max  classifier,  which 
is  an  unsupervised  algorithm,  uses  hyperboxes  for  representing 
the  input  distribution.  On  the  other  hand,  the  ILFN  classifier 
uses  the  localized  Gaussian  function  which  is  more  appropriate 
to  represent  the  distribution  of  data  space  [30] .  The  summary  re¬ 
sults  of  the  ILFN  network  and  the  fuzzy  ARTMAP  in  this  study 
are  also  included  in  the  last  two  rows  of  Table  II. 

B.  Vowel  Recognition  Data  Set 

For  the  Deterding  vowel  recognition  data  [34],  four  male  and 
four  female  speakers  were  used  for  training,  and  an  additional 
four  male  and  three  female  speakers  were  used  for  testing.  The 
data  set  is  m  ten-dimensional  input  space  with  528  samples  for 
the  training  set  and  462  samples  for  the  testing  set 
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Due  to  the  nature  of  the  vowel  data  set.  which  is  extremely 
difficult  to  separate,  a  wide  range  of  threshold  values  was  used. 
In  our  study,  which  uses  different  thresholds  ranging  from  10~: 
down  to  ltr20,  the  ILFN  classifier  generated  hidden  neurons 
ranging  in  number  from  127  down  to  78 .  as  shown  in  Table  m. 
Larger  thresholds  allowed  the  classifier  to  create  more  neurons 
than  smaller  thresholds.  However,  the  larger  number  of  neurons 
in  the  hidden  layer  does  not  imply  a  better  performance  in  pre¬ 
dicting  the  testing  data. 

Table  HI  shows  the  ILFN  classifier  performance  on  vowel 
data  using  different  values  of  the  threshold  s .  The  classifier  per¬ 
formed  on  the  test  data  in  various  percentages  of  correct  predic¬ 
tion.  When  using  the  threshold  of  IO*"1,  127  hidden  nodes  were 
generated  and  the  correct  prediction  of  testing  data  was  only 
52.81%.  When  using  the  threshold  of  10“4,  only  101  hidden 
nodes  were  generated  and  the  correct  prediction  of  testing  data 
was  improved  to  54.98%.  Using  the  threshold  of  10~fc,  the  ILFN 
classifier  was  able  to  classify  with  the  best  generalization  of 
57.36%  for  the  number  of  hidden  nodes  of  90.  Again,  when 
thresholds  smaller  than  10-s  were  used,  the  percent  of  correct 
prediction  decreased  gracefully.  The  proposed  EFLN  classifier 
was  trained  in  one  pass  through  all  data  with  an  average  training 
time  of  less  than  2  s. 

The  vowel  classification  using  various  nonlinear  classifiers 
is  shown  in  Table  IV.  The  comparison  study  was  performed 
by  Robinson  [37].  In  Robinson’s  study,  the  best  results  with 
the  correct  prediction  of  56%  were  obtained  using  the  nearest 
neighbor  classifier.  The  IFLN  can  achieve  57.36%. 

C.  Westland  Vibration  Data  Set 

This  data  set  consists  of  vibration  data  recorded  using  eight 
accelerometers  mounted  on  different  locations  on  the  aft  main 
power  transmission  of  a  U.S.  Navy  CH-46E  helicopter.  The 
CH-46E  Chinook  is  a  twin-rotor,  fore/aft  transmission  rotorcraft 
powered  by  two  turbine  engines.  The  data  set  was  archived  at 
the  Applied  Research  Laboratory  (ARL)  of  Penn  State  Univer¬ 
sity.  The  vibration  data  set  was  collected  by  using  an  Interna¬ 
tional  Recording  Instruments- Group  analog  tape  recorder  and  a 
single  mixbox  and  aft  main  transmission  installed  on  a  test  stand 
and  run  at  nine  different  torque  levels  (i.e.,  100%,  80%,  75%, 
70%,  60%,  50%,  45%,  40%,  and  27%).  While  collecting  the 
data,  only  one  faulted  component  was  installed  in  the  mixbox 
and  transmission.  Then,  vibration  data  was  recorded  for  many 
types  of  faults  listed  in  Table  V.  Employing  a  ten-channel  data 
acquisition  system,  the  data  was  then  digitized  at  a  sample  rate 
of  103  116.08  Hz  with  a  16-bit  quantization  level  and  saved  in 
1.506-MB  data  files.  All  together,  there  are  71  files;  each  file 
contains  all  eight  accelerometer  signals.  The  data  files  used  in 
this  study  were  1  s  data  files  [36]. 

1)  Westland  Data  Characteristics:  Fig.  7(a)  and  (b)  shows 
two  samples  of  vibration  data  in  the  time  domain  pertaining  to 
Fault  Class  2  and  Class  3  from  Accelerometer  1  of  the  West- 
land  Data  Archive.  As  clearly  seen,  it  is  difficult  to  discrimi¬ 
nate  the  two  raw  time-series  data.  Since  the  raw  time  series  data 
provide  little  information  to  use  for  classification,  it  is  prefer¬ 
able  to  transform  the  signal  from  time  domain  to  frequency  do¬ 
main.  The  vibration  signatures  in  frequency  domain  are  shown 
in  Fig.  8(a)  and  (b),  which  are  power  spectral  density  plots  of 


TABLE  II 

Comparison  or  Performance  among  Existing  Classifiers 
for  IRIS  Data 


i  ccnmquf 

Nc.  Wrong  Remark? 

Bayes  classifier* 

2  *  jVery  amenable  io  this  type  of  oaia. 

k-nearest  neighbo-* 

Scales  up  poorly. 

Fuzzy  k-NN* 

_ f. _ 

Allows  fuzzy  labels  for  data  points. 

Perceprroz* 

r  3 

Limited  io  linear  discnmmanon. 

Fuzzy  perceptror.  * 

2 

Fuzzifies  linear  Doundanes. 

Fuzzy  min-max  * 

2 

Single  pass  learning,  learns  on-line, 
(hypeihox  distribution). 

Fuzzy  ARTMAP 

2-6 

(Run  20  trials^ 

1 

Learns  un-ime  with  1  to  4  passes 
less  than  one  second.  Uses  hyperbox 
Qisniouticn. 

ILFN  classifier 

"i-5 

(Run  20  trials) 

One-pass  on-line  learning  m 
less  than  one  second.  Uses 

IGuassian  distribution. 

*  According  to  Simpson  in  [28] 


TABLE  HI 

ILFN  Classifier  Performance  on  Vowel  Data  Using  Different 
Values  of  the  Threshold 
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TABLE  TV 

Vowel  Classification  with  Different  Nonlinear  Classifiers  [37] 


Classifier 

Chidden 

fr  correci 

•  percent 

units 

correct 

Single-layer 

N/A 

154 

33 

Mulo -layer 

88 

254- 

51 

Muln -layer 

22 

“  "  206 

- 315 - 

Multi-layer 

l'l 

— m — 

- 44 - 

Modined  Kanerva 

328 

231 - 

Sti - 

Modified  Kanerva 

8is 

197 

43 

Radial  basis  nmcnon 

528 

247 

53 

Radial  basis  Junction 

88 

220 

48 

Gaussian  node  network 

52E 

232  J 

- 53 - 

Gaussian  node  network 

88 

247 

53 

Gaussian  node  network 

22 

250 

- 34 - 

Gaussian  node  network 

11 

211 

- 47 - 

Square  noae  network 

88 

- 253 - 

- 53 - 

Square  node  network 

22 

236 

- 31 - 

Square  noae  network 

11 

r  Jr?  1 

- 3IT— 

Nearest  neighbor 

N/A 

260 

- 53 - 

TCFN  — *  — 

90 

255 

— 3735 — 

the  two  signals  given  in  Fig.  7(a)  and  (b),  respectively.  It  is  easy 
to  see  that  frequency  contents  above  20  kHz  are  less  useful.  The 
effective  information  for  classification  is  in  the  frequency  range 
of  3  kHz-10  kHz.  For  the  frequency  range  of  0-12  kHz,  Fig.  9(a) 
and  (b)  provide  a  “zoom-in”  version  of  the  power  spectrum  den¬ 
sity  plot  shown  in  Fig.  8(a)  and  (b),  respectively.  More  sample 
plots  in  the  frequency  domain  of  100%  torque  level  of  the  West- 
land  vibration  data  are  shown  in  Fig.  10.  Fig.  10  shows  sample 
patterns  of  Fault  2,  Fault  3,  Fault  4,  Fault  5,  Fault  6,  Fault  7, 
Fault  8,  and  No  Fault  from  all  eight  accelerometers.  It  is  worth 


5'. 


heb  transactions  on  systems,  man. 


and  CYBERNETICS—: PARTE:  cybernetics. 


VCL-  31,  NO.  4.  AUGUST 


tablh  V 

UIST  O-  TEH  Fault  Types  Created  in-  tee  Test  Gearbox 

!  Descnunor 

3 

5 

6 

7 

S 

9 

— piovcuc  Plane:  urearaore/.oeanng/innerkace  Common  Mailing - ■ 

Smral  Bevel  input  Pinion  Bearing  journal  Corrosion  Pitrine/Soaliin* 

Spiral  Bevel  Input  Pinion  Gear  Tooth  Spaliing/S cuffing 

ooliecror  Gear  Crack  Propagation 

Quill  Shaft  Crack  Propagation 

No  Defect 

(a) 

Fault  #3,  Sensor#! 


He.  7  (a)  Plot  of  time  series  data  of  Fault  2  from  Sensor  1.  (bl  Plot  of  tim 

senes  data  of  Fault  3  from  Sensor  I. 


noting  that  data  from  each  sensor  alone  is  not  sufficient  to  cl 
sify  the  fault  classes.  Moreover,  it  is  easier  to  classify  the  d 
by  using  all  patterns  obtained  from  the  eight  sensors.  Integrati 
fault  patterns  from  all  eight  accelerometers  is  more  informat 
for  classification.  In  this  study,  most  of  our  experiments  used  l 
combined  signatures  from  all  eight  sensors  as  training  patter 
2)  Using  ILFN  Classifier  on  Westland  Vibration  Data: 
our  experiments,  vibration  time-series  data  was  preprocess 
using  the  fast  Fourier  transform  (EFT)  technique  to  transfoi 
from  the  time  domain  to  the  frequency  domain.  A  Hanni 
window  of  1024  samples  was  utilized.  We  filtered  the  data  wi 
the  interested  frequency  band  of  3  kHz-10  kHz,  getting  a  141  > 
vector  for  each  channel.  Vectors  from  the  eight  channels  we 
set  mto  one  vector  (1128  x  1  vector).  Then,  they  were  used 
framing  data  for  the  ILFN  classifier  as  well  as  the  other  clasi 
fiers  used  in  this  study.  The  Fault  types  and  torque  levels  of  tl 
Westland  vibration  data  used  in  the  experiments  are  shown 
Tables  VI  and  VH.  The  summary  results  from  the  experimen 
are  given  in  Table  VIE. 


Table  vm  shows  the  results  from  all  simulations  in  our 
studies.  In  the  simulations  of  the  Westland  data  set,  all  torque 
load  levels  (i.e.,  100%,  80%,  75%,  70%,  60%,  50%  45% 
40%,  and  27%)  were  used  to  train  the  ILFN  classifier.  Only 
ten  patterns  were  used  for  training,  and  the  remaining  patterns 
were  used  for  testing  in  each  torque  level.  All  patterns  were 
used  for  training  when  different  torque  load  levels  were  used 
for  testing.  For  the  last  column  of  Table  VIE,  the  training  set 
was  composed  from  the  first  ten  patterns  of  each  torque  level. 
The  threshold  value  was  selected  between  0  and  0.7  and  the 
initial  standard  deviation  selected  was  0.001.  This  yielded  the 

same  number  of  hidden  neurons  and  the  same  classification 
performance. 

In  Table  VIE,  the  columns  represent  the  “training  data”  with 
different  torque  levels,  and  the  rows  indicate  the  “testing  data” 
with  different  torque  levels.  The  percent  of  correct  classification 
results  are  interpreted  by  crossing  each  column  with  each  row. 
For  instance,  100%  coirect  classification  was  achieved  when 
the  ILFN  network  was  trained  by  the  40%  torque  level  and  was 
tested  by  the  50%  torque  level.  The  numbers  of  hidden  neurons 
resulting  from  the  training  process  of  the  ILFN  classifier  are 
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Fig.  9.  (a)  Power  spectrum  density  plot  of  Fault  2  from  Sensor  1  (b)  Power 

spectrum  density  plot  of  Fault  3  from  Sensor  1. 
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Fig.  1 0.  Power  spectrum  density  plot  of  1 00%  torque  load  with  different  faults 
irom  eight  sensors. 


shown  in  Table  Vm  in  the  row  “Hidden  nodes.”  These  numbers 
indicate  how  many  prototypes  the  ILFN  classifier  has  created. 

Using  ten  patterns  of  the  100%  torque  level  for  training, 
the  classifier  created  eight  neurons  in-  the  hidden  layer.  The 
ILFN  classifier  obtained  100%  correct  classification  using  the 
remaining  data  of  the  same  torque  load  for  testing.  Moreover, 
using  all  400  patterns  of  the  100%  torque  level  for  training,  the 
ILFN  network  also  created  eight  neurons  in  the  hidden  layer. 
The  other  torque  levels  were  used  to  test  the  performance  of 
the  ILFN  network.  The  correct  classification  of  71.43%  was 


achieved  foi  the  80%  torque  load;  for  the  75%  toroue  ioad 
81.42%  correct  prediction  was  obtained.  The  ILFN  ciassifie- 
vieided  the  correct  classification  of  57.14%,  41.2%.  37.2% 
42.5%,  4%,  and  7.4%  for  toroue  levels  of  70%  fine  -net 
45%,  40%,  and  27%,  respectively. 

It  is  worth  noticing  that  when  the  same  toroue  level  was  used 
both  for  training  and  testing,  the  ILFN  classifier  achieved  100% 
correct  classification.  (Note  that  testing  patterns  were  different 
from  the  training  patterns,  i.e.,  obtained  from  different  time 
senes,  but  they  were  in  the  same  toroue  level.)  Furthermore 
using  high  torque  levels  (i.e.,  100%  'and  80%)  for  training, 
in  the  testing  phase,  the  ILFN  classifier  achieved  perfect 
classification  only  when  the  testing  patterns  from  the'  same 
torque  level  were  used.  However,  using  75%,  70%,  60%.  50%. 
45%,  40%,  or  27%  torque  level  for  testing,  the  ILFN  network 
was  able  to  correctly  classify'  a  larger  range  of  torque  levels. 
For  example,  when  the  ILFN  classifier  was  trained  by  using 
a  50%  torque  level,  100%  correct  classification  was  obtained 
from  the  range  of  40%  through  50%  torque  levels. 

3)  Comparisons  Among  Other  Classifiers:  More  exper¬ 
imental  ^  results  on  Westland  vibration  data  are  shown  in 
Table  IX,  which  shows  the  comparison  among  the  MT.P  RBFN, 
LVQ,  and  ILFN  classifier.  This  simulation  was  performed  using 
200  patterns  of  100%.  torque  levels  to  train  each  classifier.  The 
testing  data  sets  were  composed  of  the  remaining  200  patterns 
from  100%  torque  load,  700  patterns  from  80%  torque,  350 
patterns  from  75%  torque,  and  700  patterns  from  70%  torque 
load.  The  data  used  were  1128-dimensional  vectors  that  were 
combined  from  all  eight  sensors.  CPU  usage  time  averaged 
over  50  runs.  The  results  are  shown  in  Table  IX. 

The  first  network  was  the  MLP  trained  by  the  Backpropaea- 
tion  with  variable  learning  rates.  The  MLP  was  comprised=of 
one  hidden  layer  with  ten  hidden  nodes  and  one  output  layer 
with  four  nodes.  (Output  targets  are  labeled  in  binary'  form.) 
Logsigmoidal  functions  were  utilized  in  the  MLP  network.  The 
sum  of  square  error  (SSE)  goal  was  set  to  0.001.  The  MLP  was 
trained  for  50  trials.  To  meet  the  SSE  goal,  the  MLP  network 
used  a  training  time  of  475  iterations  with  400  s  on  the  av¬ 
erage  of  50  runs.  We  noticed  that  for  many  trials,  the  MT  .P  was 
stuck  at  some  local  minima,  unable  to  converge  to  the  specified 
goal.  The  second  network  was  the  RBF  network.  Using  one-pass , 
self-selection  of  the  hidden  centers  by  a  successive  approxima-' 
tion  method  [38],  the  RBFN  constructed  eight  hidden  neurons 
in  the  hidden  layer.  Then,  the  output  weight  was  determined 
usmg  the  method  proposed  by  Haykin  [14],  The  RBFN  quickly 
learned  within  a  single  iteration.  The  third  network  was  the  LVQ 
network.  The  network  was  composed  of  an  eight-neuron  LVQ 
layer  and  a  four-neuron  linear  layer.  The  maximum  training  time 
of  the  LVQ  was  set  to  be  500  iterations.  The  LVQ  used  approx¬ 
imately  194  s  for  training  on  the  average  of  50  runs 
The  ILFN  classifier  incrementally  learned  and  generated 
eight  neurons  in  the  hidden  layer.  The  training  time  was  about 
4  s  within  a  single  iteration.  On  this  data  set  the  ILFN  network 
used  a  training  time  approximately  100  times,  three  times, 
and  64  times  faster  than  the  MLP,  the  RBFN,  and  the  LVQ,' 

r“^c2Xely‘  For  generalization  capacity,  it  was  shown  that 
the  ILFN  classifier  was  competitive  with  the  LVQ.  In  Table  IX, 
both  the  ILFN  classifier  and  the  LVQ  were  able  to  classify  the 
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_  TABLE  V! 

--A..T  TYPES  .AND  TORQUE  LEVELS  OP  WESTLAND  VIBRATION  DATA  USED  IN'  THE  EXPERIMENTS 
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TABLE  VB 

orque  Levels  and  the  Number  of  Patterns  Used  in  the  Experiments 


Torque  levels 

1 100% 

80% 

75% 

70% 

60% 

50% 

45% 

40% 

27% 

#  patterns 

400 

r  700 1 

350 

700 

1  500 

500 

1  400 

500 

[  500 

100%-torque-load  testing  data  with  100%  correct  classification 
Tne  percent  correct  classification  of  the  two  networks  was 
reduced  to  71%,  74%,  and  57%  when  using  80%,  75%,  and 
70%  torque  load  for  testing,  respectively.  However,  considering 
the  capability  of  on-line,  real-time,  incremental  learning,  the 
ILFN  network  was  superior  to  the  LVQ.  Moreover,  based  on 
generalization  and  fast  on-line  learning  ability,  the  ILFN  classi¬ 
fier  was  superior  to  the  MLP  and  to  the  RBFN  off-line  learning 
algorithms.  This  result  emphasizes  that  the  ILFN  classifier  is 
suitable  ror  machinery  condition  health  monitoring  systems 
which  require  a  fast  on-line  real-time  learning  algorithm. 

4)  ILFN  Learning  Simulation  in  an  Unsur  ervised  Learning 
Mode:  To  study  the  ability  of  the  ILFN  classifier  in  an  unsuper¬ 
vised  learning  mode,  which  is  usually  used  when  the  ILFN  net¬ 
work  is  monitoring  a  system,  we  used  100%  torque  load  to  train 
the  ILFN  network.  First,  only  a  “No  Fault"  class  was  trained  to 
tne  network.  Acting  as  a  monitoring  system,  the  ILFN  classifier 

received  unseen  patterns  in  order  to  classify  them, 
The  ILFN  network  was  able  to  detect  new  classes.  It  learned  the 
incoming  faults  by  creating  new  neurons  and  designating  new 
mrgets  for  the  unseen  patterns  that  were  significantly  different 
from  the  patterns  that  had  been  learned. 

Table  X  Illustrates  the  performance  of  the  ILFN  network  in  an 
unsupervised  learning  mode.  In  Table  X,  first  the  ILFN  classifier 

US“g  daSS  1°  ^  With  ^  ‘“fading  target 
0000  Then,  patterns  from  Fault  8,  Fault  5,  Fault  2,  Fault  3, 
Fault  6,  Fault  7,  and  Fault  4  were  presented  to  the  ILFN  network 
without  corresponding  targets.  The  ILFN  network  assigned  tar¬ 
gets  to  be  “0001,”  “0010,”  “0011,”  “0100,”  “0101,”  “0110,”  and 
01 11,  respectively.  In  carder  to  have  different  targets  with  the 
existing  targets,  first  the  ILFN  classifier  checked  the  existing 
targets  finding  the  highest  number  in  the  target  module.-  Then* 
using  the  increment  of  the  highest  number  by  one,  the  ILFN 
classifier  assigned  the  new  target  to  the  incoming  pan*™ 


VI.  Conclusions 

A  new  algorithm  based  on  fuzzy  neural  networks  called 
incremental  learning  fuzzy  neural  (ILFN)  network  has  been 
developed  for  partem  classification.  The  ILFN  classifier  em¬ 
ploys  a  hybrid  supervised  and  unsupervised  learning  scheme  to 
generate  its  prototypes.  The  network  is  a  self-organized  classi¬ 
fier  with  the  capability  of  adaptively  learning  new  information 
without  forgetting  existing  information.  The  classifier  can 
detect  new  classes  and  update  its  parameters  while  monitoring 
a  system.  Moreover,  it  utilizes  fast  real-time  on-line  learning 
without  knowing  a  priori  information.  In  addition,  it  has  the 
capability  to  make  both  soft  (fuzzy)  and  hard  (crisp)  decisions 
and  is  able  to  classify  both  linear  separable  and  nonlinear 
separable  problems. 

The  network  is  a  synergetic  integration  of  fuzzy  sets  and 
neural  networks.  It  employs  the  fast  parallel  computation  and 
learning  capability  of  neural  networks.  In  addition,  fuzzy  set 
theory  adds  the  ability  to  represent  and  manipulate  imprecise 
information. 

Three  benchmark  data  sets  (the  Fisher’s  Iris  data  set,  the  De- 
terding  vowel  data  set,  and  the  Westland  vibration  data  set) 
™  “  simuladons  t0  demonstrate  the  performance  of 

the  ILFN  classifier.  Comparisons  between  the  ILFN  classifier 
and  some  existing  methods  were  made.  The  results  show  that, 
m  terms  of  classification  performance,  the  ILFN  classifier  is 
competitive  with  or  even  better  than  many  well-known  classi¬ 
fiers,  including  the  MLP,  the  RBFN,  the  LVQ,  and  the  Fuzzy 
AKTMAP  classifier.  Additionally,  in  terms  of  training  time,  the 
ILFN  network  is  superior  to  classical  classifiers.  Furthermore 
the  on-line,  real-time,  one-pass,  incremental  learning  behaviOT 
allows  ILFN  network  to  detect  new  classes  and  update  its  pa¬ 
rameters  without  using  old  data  to  retrain  the  network.  The  ILFN 
classifier,  acting  as  a  component  in  a  monitoring  system,  was 
used  extensively  to  investigate  the  Westland  vibration  data.  The 
results  from  the  simulation  studies  have  shown  .that  the  real-time 
and  on-line  ILFN  classifier  is  efficient  for  fault  classification 
ana  identification  in  machine  condition  monitoring. 

Several  qualitative  issues  of  the  ILFN  classifier  remain  to  be 
investigated  in  the  near  future.  The  important  issues  mchite 
1)  convergence  analysis  of  the  incremental  learning  rule; 
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TABLE  VIH 

Percent  Correct  Classification  of  toe  ILFN  Classifier  for  toe  Westland  Vibration  Data  wk  Different  Torque 

Levels  for  Training  and  Testing 


TcshnG  dac  j 

Training 

data  (tsrnue 

levels) 

(torque  levels)  j 

100%  | 

80% 

75%  j 

70%  | 

60%  | 

50%  | 

45%  | 

40%  j 

27% 
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100 

60.57 

35 

36  | 

40  | 

40  | 
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40 
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80% 

71.43 

100 
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59.71 
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32.33 
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32.33 
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71.43 
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39 
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100 
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j  48.6 
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100 

92.6 
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.Remark :  ( 1  j’Only  1 0  panerns  were  used  for  naming  when  the  same  mrque  loao  was  used  for  resang. 

(2)  All  patterns  were  used  for  training  when  a  different  torque  load  was  used  for  testing. 

(3)  For  the  last  column,  only  10  patterns  from  each  torque  load  level  were  used  for  training. 


TABLE  IX 

Percent  Correct  Classification  of  the  MLP.  RFBN,  LVQ,  and 
it  fn  Network,  Trained  by  100%  Torque  Level,  and  Tested  by 
Different  Torque  Levels 


Testing  aataT 


Classifier  types  (Trained  with  100%  torque  level) 


(torque  level) 

MLP 

RBFN 

LVQ 

ILFN 

TOT 

96.5 

w> 

nsc 

*"  m  ' 

80% 

j  ssTn 

4.57 

71.43 

71.43 

75% 

61.14 

0.57 

j  7429 
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Learning  type 

OFF-LINE 

OFF-LINE: 

OFF-LINE 

ON-LINE 

Learning  time 

400  Sec 
475  epochs 

12  Sec 

1  epoch 

194  Sec 
500  epochs 

4  Sec 

1  epoch 

TABLE  X 

The  ILFN  Classifier  assigned  Classes  to  the  Unseen  Patterns 


Learned  Fault 

Faults 

Labeled  classes 

No  fault 

0 

0 

0 

0 

Fault  B 

0 

0 

0 

1 

.Faults 

0 
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1 
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Fault  2 
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Unseen  faults 

Fault  3  1 
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Fault  6 

T 

1 

0 
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_ 

1 

1 

0 

Fault  4 

_ 

1 

0 

0 

2)  parameter  survey  of  the  initial  standard  deviation  ctq,  and 
the  threshold  e; 

3)  generalization  performance  of  the  ILFN  classifier  with 
respect  to  the  number  of  hidden  neurons. 
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Abstract  -  A  hierarchical  architecture  that  combines  a  high  degree  of 
reconfigurability  and  long-term  memory  is  proposed  as  a  fault  tolerant 
control  algorithm  for  complex  nonlinear  systems.  Dual  Heuristic 
Programming  (DHP)  is  used  for  adapting  to  faults  as  they  occur  for  the 
first  time  m  an  effort  to  prevent  the  build  up  of  a  general  failure,  and 
also  as  a  tuning  device  after  switching  to  a  known  scenario.  A  dynamical 
database,  initialized  with  as  much  information  of  the  plant  as  available, 
oversees  the  DHP  controller.  The  decisions  of  which  mode.s  to  reco^ 
when  to  intervene  and  where  to  switch  are  autonomously  taken  based  on 
specifically  designed  quality  indexes.  The  results  of  the  application  of  the 

Z|P  n?,a  8°rn  m  *°  a  Pro°f-of-the-concept  numerical  example  help  to 
illustrate  the  fine  interrelations  between  each  of  its  subsystems. 

nJ''dfX  ^'■"'S-Fau'ttoierant  control,  fault  detection  and  isolation, 
neural  networks,  adaptive  critic,  multiple  models. 

I.  Introduction 

Increased  performance  requirements  are  often  achieved  at  the 
cost  of  plant  and  control  simplicity.  As  overall  complexity  rises,  so 
does  the  chance  of  occurrence,  diversity  and  severity  of  faults 
Therefore,  availability,  defined  as  the  probability  that  a  system  or 
equipment  will  operate  satisfactorily  and  effectively  at  any  point  of 
time  [1],  becomes  a  factor  of  great  importance.  For  automated 
production  processes,  for  example,  availability  is  now  considered  to 
be  the  single  factor  with  the  highest  impact  on  profitability  [21.  Fault 
Tolerant  Control  (FTC)  is  a  field  of  research  that  emerges  to 
increase  availability  and  reduce  the  risk  of  safety  hazards  by 

SL  ’y,  de!!.gning  control  algorithms  capable  of  maintaining 
stability  and  performance  despite  the  occurrence  of  faults  [3,41. 

As  complex  systems  suffer  from  faults,  the  original  model 
parameters,  or  even  their  own  dynamic  structure,  may  change  in  a 
multitude  of  unpredictable  ways.  Even  if  the  system  has  a 
satisfactory  linearization  around  the  nominal  operating  point 
nonlineanties  may  become  of  paramount  importance  after  a  fault 
occurs  15].  Since  complex  systems  pose  a  challenge  even  in  the 
design  of  models  under  nominal  conditions,  the  task  of  off-line 
devising  nonlinear  high  order  models  for  all  known  fault  scenarios 
can  be  a  daunting  one.  When  the  stochastic  nature  of  faults  is  taken 
into  consideration,  and  to  even  possess  knowledge  of  all  fault 
scenarios  is  made  impossible,  it  becomes  clear  to  see  that  the 
problem  of  interest  to  FTC  cannot  be  dealt  with  without  on-line 
nonlinear  adaptive  control  strategies.  In  the  proposed  architecture 
Dual  Heuristic  Programming  (DHP),  an  Adaptive  Critic  Desig^ 
(ACD),  was  chosen  as  the  reconfigurable  controller  due  to  its  known 
effectiveness  to  work  in  noisy,  nonlinear  environments  while 
making  minimal  assumptions  regarding  the  nature  of  that 
environment  [6]. 

To  our  best  knowledge,  the  application  of  the  DHP 
reconfigurable  controller  represents  one  of  the  most  effective  ways 
to  deal  with  the  unexpected  dynamics  that  a  plant  may  assume  after 
the  occurrence  of  a  fault.  However,  as  a  FTC  scheme  by  itself  the 
use  of  a  reconfigurable  controller  such  as  DHP  presents  two  main 
limitations.  Die  first  one  arises  from  the  fact  that  solutions  to  a  set 
of  expected  fault  scenarios  are  often  available  and  may  involve  the 
application  of  very  specific  control  laws.  A  reconfigurable  controller 
alone  however,  does  not  provide  any  mechanism  through  which 
knowledge  available  during  design  time  can  be  incorporated  The 
second  limitation  arises  from  the  known  tradeoff  between  adaptation 
and  long-term  memory.  As  the  reconfigurable  controller  provides 


faster  convergence  to  a  wider  range  of  control  solutions,  it  fails  to 

SdtinS  6e  "  “nml  iaWS  <ta»ned  f°'  Previously 

th7nHperC0I^e  n0th  llnlitations’  a  novel  supervisor  system  oversees 
m  controj!er  m  ^  architecture  displayed  in  Figure  1.  The 
Identifier  and  Controller  Dynamical  Database  (ICDD),  located 
inside  the  supervisor  contains  the  knowledge  available  during  desien 
time,  as  well  as  solutions  devised  online  for  unexpected  fault 
scenarios.  The  decisions  of  when  to  intervene  by  switching  to  a 

COntrol  so’uhon  “d  when  to  add  a  new  identifier  and 
controller  pair  to  the  database  are  taken  by  the  supervisor  based  on 
e  current  fault  scenario.  Such  information  is  extracted  by  the 
scenario  recognition  module,  which  makes  use  of  specifically 
designed  quality  indexes  capable,  not  only  of  performing  Fauh 
^  Ilentlf,C?t,on  FDD.  but  also  to  produce  indispensable 
°nih(?eV°  Utl0n  of  a  fault  trough  time.  The  synergetic 
combination  of  the  superior  adaptation  capabilities  of  the  DHP 
controller  with  the  fault  information  and  long-term  memory 
provided  by  the  multiple  model  structure  [7]  of  the  proposed 
supervisor  generates  an  advanced  FTC  scheme  capable  to  deal  with 
a  diversified  collection  of  actuator  and  component  faults. 


The  remaining  of  the  letter  ^organized  as  follows.  In  Section  II 
the  problem  is  stated  along  with  a  description  of  the  goal  of  the  FTC 
^“on:  Allowed  by  the  presentation  of  the  reconfigurable 
controller  in  Section  III.  In  Section  IV  the  proposed  supervisor 

testin' detalfl’  T*  ?  interaCti°n  with  controller  is 
tWn^  ?  u  ’  F,nally*  Sectlon  V  bn'ngs  a  numerical  example 
through  which  some  of  the  key  features  of  the  proposed  FTC  are 

“on,5"''0"  "  *  "»»  «  SS 

II.  Problem  Statement 

whSf  W°rk  f°,CU.SeS  °"  comPlex  nonlinear  systems  upon 

u31"  C  °  3W  Was  applied  t0  make  capable^  of 

p  orming  its  mission  under  nominal  operational  conditions  The 

3nA  .tInier  "0nUnal  conditions  is  thereby  considered  to  be  stable 
d  with  the  desirable  degree  of  performance.  Due  to  the  generic 
nature  of  the  considered  faults,  no  guarantee  of  stability  in  a  fault 
scenario  is  required  of  the  nominal  control  law.  V 

f?CUSfS  0n  actuator  and  component  faults,  leaving 
ensor  faults  to  be  identified  and  recovered  in  parallel  by  any  of  the 
currently  available  methods  such  as  sensor  fusion  [8,9]  and 
specialized  filters  [10],  For  a  recovery  to  be  at  all  possible  the 
required  redundancy  (hardware  or  analytical)  is  supposed  to  exist  in 
the  system.  From  the  theoretical  point  of  view,  this  statement 
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matfh“,..the  requirement  for  sustained  observability  and 
controllability  through  fault  scenarios. 

AS  Stated  in  the  IFAC  -  SAFEPROCESS  terminology  definition 
[3],  the  goal  of  fault  tolerant  control  is  to  maintain  control 
o  jectives,  although  a  loss  of  performance  may  be  accepted.  Even 
though  speed  is  not  an  imperative  issue,  the  system  must  converge 
m  a  finite  time  to  a  stable  trajectory  as  close  to  the  desired  one  as 
possible  under  the  effects  of  certain  faults.  Stability  during  transient 
phases  is  also  a  key  feature  to  achieve  a  safe  and  efficient  fault 
accommodation  scheme.  Trajectory  tracking  and  time  to  recover  are 
both  quantitative  measures  that  can  be  used  to  evaluate  the 
performance  of  FTC  methods. 

Noise  and  low  intensity  disturbances  are  always  present  in  real 
world  applications,  requiring  robust  control  solutions.  Stronger 
disturbances  and  component  aging  are  examples  of  situations  that 
may  cause  significant  changes  in  the  system.  In  the  proposed 
scheme,  those  changes  may  reflect  in  variation  of  parameters  of  the 
identified  model,  requiring  the  controller  to  be  self  adaptive  In 
exn-eme  cases,  FTC  requires  the  controller  to  completely  restructure 
itself  to  cope  with  drastic  changes  in  the  dynamics  of  the  plant. 

III.  Reconfigurable  Controller 
Using  Adaptive  Critic  Design 

t0,achleve  the  required  degrees  of  reconfiguration  and 
stability,  the  adaptive  controller  can  benefit  greatly  if  more  than  the 
simple  instantaneous  difference  between  desired  and  actual  states  is 
“  1 ”  “«•  “  "  into.  Due  to  to. 

interaction  between  the  controller  and  the  plant  however,  the  quality 
of  a  certain  control  strategy  can  only  be  fully  measured  after 
analyzing  all  future  effects  it  has  on  the  control  mission,  which  in 
our  case  is  trajectory  tracking.  Such  measure  of  control  quality  is 
Janslated  into  the  Hamilton-Jacobi-Bellman  Equation  (1)  that 
defines  the  cost-to-go  J(i)  [li]. 


IV.  Supervisor 

it  I°nbhf  a and  *e  functionality  of  the  supervisory  system 
Li  be.dlV,^ed  lnt0  three  layers.  The  first  laver  collects  and 
analyzes  data  from  the  plant  and  the  controller  block.  The  second 
one  ,s  responsible  for  the  decision  making,  while  the  finaMaver 
devises  ways  to  implement  the  resolutions. 

The  first  layer  receives  the  sampled  output  of  the  plant  R(t)  and 
a  delayed  input  u(t-1) ,  computes  two  quality  indexes  and  indicates 
the  known  scenario  that  better  approximates  the  cun-ent  dynamics 
The  first  quality  index,  gc(t) ,  measures  the  reconfigurable 
controller  performance  by  performing  a  decaying  sum'  of  the 
primary  utility  function  of  the  active  controller  as  shown  iii  Equation 


f 

fe-f'("T)U(T)dT , 


(2) 


where  0  <  £c  <  l  is  a  time  decay  factor. 

■  .f°r  ^calculation  of  the  second  quality  index,  the  delayed  input 

he^enHF  ”?  **  m°del  database‘  71,6  da'abase  contains  a  copy  of 
the  identification,  action  and  critic  networks  used  to  control  the  plant 

under  nominal  operation  as  well  as  copies  for  each  known  fault 

then  ^ °h  f aCH  °nC  °fthe  identifkati°n  networks  in  the  database  is 
Aen  used  to  generate  an  identification  error.  For  each  one  of  those  a 
decaying  sum  similar  to  (2)  is  used,  and  the  results  are  compared  As 
shown  in  (3),  the  smallest  identification  error  history  defines  *,(/)  , 

Ae  identification  quality  index,  and  the  model  d  that  generates  it  is 

appointed  as  the  switching  candidate. 


J(t)^ykU{t+k), 


k~Q 


(1) 


9/(0  =  min 

</eD 


Rd(T)-R(AdT 
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(3) 


where  y  is  a  discount  factor  for  finite  horizon  (  0  <  y  <  1 ),  ^ 

U{t)  is  the  (primary)  utility  function  or  local  cost  that  can  be 

df*!”ed  as  the  ,squared  tracking  error  at  each  instant.  This  kind  of 
problem  formulation  is  Ae  main  focus  of  dynamic  programming  a 

from  !he  faT,g  te±T  th*  solves  *  trough  a  baclLrd  Sch 
from  the  final  step  [12],  To  make  the  problem  tractable  to  an  on-line 
learning  approach,  Heuristic  Dynamic  Programming  (HDP)  an 
ACD,  introduces  a  critic  block  responsible  for  approximating  Ae 
cost-to-go  ./(,)  t,3].  The  controller,  usually  referreS  to  as  the  actor 
or  action  block  m  ACD  literature  [14],  is  Aen  adapted  in  Ae 
direction  of  Ae  minimization  of  J{t)  by  using  the  critic  to  analyze 
the  impact  of  its  actions  over  Ae  cost-to-go.  DHP  is  a  more 
advanced  ACD  capable  of  providmg  superior  performance  due  to 
smoother  and  more  accurate  infoimation  on  how  J(t)  is  affected  by 

derivativp1  by  USi"S  the  critic  t0  directly  estimate  the  partial 
den  vative  of  the  cost-to-go  wiA  respect  to  the  plant  states. 

^  imPIemented  w>th  any  differentiable 
structure  [12],  neural  networks  have  been  widely  used  [15,16]  due  to 
their  generalization  and  nonlinear  mapping  capabilities  as  well  as 
having  suitable  meAods  for  on-line  learning  [17],  Given  Ae  system 
of  interest,  Recurrent  Neural  Networks  (RNN)  were  chosen  <L  to 
their  more  efficient  handling  of  dynamic  nonlinear  mapping  [18] 
Backpropagation  Through  Time  (BPTT)  [19]  was  used  to  extract  the 
approximation  of  the  partial  derivatives  required  during  traming. 


where  0<  f,  <  1  is  a  time  decay  factor  and  Rd(t)  is  the  vector  of 

thC  idCntified  d  ’  Whkh  in  tUrn  is  an  eleme«  of 

the  set  of  identifiers  in  the  database  D 

In  the  second  layer,  a  threshold  is  defined  for  each  of  the  quality 
indexes,  dividing  them  into  high  (  Hqc ,  Hq, )  and  low  (Lq*  Lq\ 

values.  The  Areshold  for  (,)  defines  what  is  to  be  considered  Is 
an  acceptable  performance,  while  the  one  for  q,.(t)  stipulates  the 
degree  of  similarity  of  the  input-output  behavior  that  should  be  used  * 

r;srrt,s;i!,,nct  F“r  “= » 1  t  s 

?^  ,t,ned’  and  the  decisl°n  process  illustrated  in  Figure  2  takes 
place.  It  s  important  to  notice  Aat  in  Ais  formulation  Ae  actions  of 

rb“'  **•  p,"“ » » 

oetween  states.  This  characteristic,  added  to  Ae  imnrnvpH 
smoothness  of  Ae  quality  indexes  bestowed  by  the  regressive^ean 
aids  in  Ae  generation  of  Ae  hysteresis  required T  Sm  Ae 

S"Ching  SCh6me  t0  •“  Vrious  SS£ 

controlled  idperfo'ntong  ’s^risfaSorild’o^  .‘‘ESIn'SS!? 
and  no  action  is  required.  While  in  Ais  state,  an  abrupt  fault  mav 
cause  the  performance  to  be  degraded  enough  for  Ae  controller 
quality  index  qc(t)  to  surpass  its  threshold.  In  Ais  case,  qt(i)  will 
remain  low  or  grow  on  Ae  respective  events  of  a  known  or  unknown 


2 


FIGURE  2.  Decision  graph  of  the  second  layer  of  the  supervisor 
system.  The  states,  tagged  1  to  4,  are  defined  by  the  quality  measures  qc  (/) 

and  qt  (t) .  The  moments  when  the  actions  of  switching  and  adding  to  the 
database  are  performed  are  shown  on  the  graph. 


implemented  by  loading  a  complete  set  of  parameters  of  the  three 
neura1  networks  (i.e.  identification,  action  and  critic  networks)  to  the 
DHP  algorithm  currently  being  used.  The  fact  that  the  controller  is 
switched  to  one  devised  to  a  similar  plant  and  the  natural 
generalization  capabilities  of  neural  networks  add  to  improved 
stability  when  the  parameters  are  loaded  as  new  initial  conditions  to 
the  adaptive  process.  The  database  also  stored  copies  of  all  the 
partial  derivatives  required  when  updating  the  networks  using 
backpropagation  through  time.  Uploading  those  derivatives  also 
works  to  increase  switching  smoothness  since  more  information 
about  the  new  dynamics  of  the  plant  is  supplied. 


V.  Numerical  Example 

In  this  section,  a  numerical  example  is  used  as  an  illustration  of 
the  dynamics  of  the  proposed  FTC  algorithm.  Special  emphasis  is 
given  to  the  actions  of  the  supervisor  system,  which  is  the  intelligent 
core  of  the  algorithm.  For  the  sake  of  simplicity  and  understanding 
the  plant  consists  of  a  simple  linear  ARMA  model,  subject  to  faults 
resulting  in  changes  in  its  parameters.  This  example  by  no  means 
reflects  any  limitations  of  the  architecture,  which  has  been  designed 
to  deal  with  complex  nonlinear  plants  that  may  have  their  very 
dynmncs  altered  by  the  occurrence  of  a  fault.  The  models,  sampled 

3t  jHZ’  ,USld,t0  Slmulate  to*5  plant  under  nominal  operation  and 
under  each  of  the  artificial  faults,  are  given  below. 


If  both  indexes  exceed  the  threshold  (state  2),  the  environment 
has  abruptly  changed  due  to  an  unknown  fault,  and  the  supervisor  is 
unable  to  provide  any  help  to  the  DHP  controller.  If  ?,.(r)  remains 

low  (state  4),  there  is  already  a  set  of  DHP  parameters  in  the  ICDD 
previously  adapted  to  deal  with  a  plant  with  similar  dynamics  and  to 
which  switching  should  take  place.  The  decision  process  then 
remains  m  state  4  (Hqc  and  Lq,)  until  either  the  system  is 
recovered  or  another  fault  takes  place  before  that.  If  the  composite 
fault  is  also  a  known  fault,  switching  takes  place  again  triggered  by 
the  change  in  the  switching  candidate  appointed  by  the  first  decision 
layer. 

Incipient  faults,  often  connected  to  component  aging,  may  be 
gradually  adapted  by  the  reconfigurable  controller  and  eventually 
indicate  a  high  ?,  (/) ,  even  though  qc(t)  remains  low  during  all  the 
process  (transition  from  state  1  to  3).  In  this  case,  there  is  no  purpose 
in  learning  a  new  environment/controller  pair  since  the  parameters 
are  continuously  changing.  As  a  matter  of  fact,  if  allowed  to  learn  all 
the  transient  models,  the  database  might  rapidly  grow  to  an 
intractable  size. 

When  the  DHP  controller  is  adapting  to  a  new  environment  (state 
2),  qc{t)  is  expected  to  decrease  to  the  point  where  it  crosses  its 
threshold  (transition  to  state  3),  and  a  new  set  of  parameters  is  added 
to  the  ICDD,  but  two  other  scenarios  must  also  be  considered.  The 
first  one  deals  with  the  possibility  of  an  abrupt  known  fault  to 
happen  before  the  first  fault  is  completely  dealt  with.  In  this  case 
?,-(0  reaches  a  low  value  prior  to  qc(t)  and  switching  to  the 

known  environment  takes  place.  The  second  scenario  addresses  the 
situation  in  which,  due  to  actuator  or  physical  limitations,  although 
the  controller  is  capable  of  reconfiguring  itself  to  generate  a  stable 
trajectory,  the  performance  remains  below  the  desired  level.  In  such 
case,  the  decision  logic  remains  in  state  2  as  the  reconfigurable 
controller  cannot  be  improved  by  supervisor  intervention. 

The  third  layer  manipulates  the  ICDD  by  making  new  entries  and 
by  switching  to  the  reconfigurable  controller  indicated  by  the  first 
layer,  when  requests  arrive  from  the  second.  Switching  is 
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The  incipient  fault  occurs  over  the  nominal  model,  changing  its 
dynamics  gradually  until  the  one  given  above.  The  simulation  was 
earned  with  the  plant  being  abruptly  changed  to  a  different  model  at 
every  10  minutes.  The  goal  is  to  follow  a  trajectory  composed  by  a 
sine  wave  that  changes  the  amplitude  randomly  at  every  half  a 
penod.  Since  m  this  case  no  beforehand  information  about  the 
system  is  given  to  the  plant,  initially  the  nominal  plant  is  treated  as 
an  unknown  fault  and  therefore,  as  displayed  in  Figure  3,  the  system 
begins  in  the  state  where  both  quality  indexes  are  high. 

After  the  initial  transient  response,  as  soon  as  qc(t )  indicates  a 
low  value,  the  supervisor  flags  a  control  success  and  adds  the 
nominal  model  to  the  ICDD.  The  copy  of  the  identification  network 
that  is  now  part  of  the  ICDD,  generates  a  low  identification  e™ 
causing  q.{t)  to  drop  sharply. 

The  identification  quality  index  q.(t)  remains  low  after  a  model 
is  added  even  though  the  training  has  been  stopped  and  new  inputs 
are  be'"8  supplied,  indicating  two  main  achievements  of  the 
proposed  FTC  scheme.  The  first  one  is  that  the  neural  network  used 
as  an  identifier  in  the  DHP  architecture  was  capable  of  converging 
to  represent  the  true  dynamics  of  the  system.  The  second  is  that  the 

w!^rS°r  "a*  ablC.it0  recognize  the  Pr°Per  moment  when  a  new 
identifier  and  controller  pair  should  be  memorized. 

After  the  first  10  minutes  of  simulation  the  first  fault  occurs 
abruptly,  changing  the  dynamics  of  the  plant.  While  the 
reconfigurable  controller  adapts  itself  to  the  new  scenario  both 
indexes  grow  indicating  that  the  system  is  going  through  an  Abrupt 
unknown  fault.  As  qc(t)  drops  to  an  acceptable  level,  the  first 

failure  mode  is  recorded  along  with  the  controller  that  was 
specifically  designed  on-line  to  deal  with  if. 
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..  j™!"?*  3’  TheJt°P  brings  desired  trajectory  (dashed),  the  output  of  the  plant  (solid)  and  the  output  of  the  identification  network  while  it  adapts 
(dotted).  The  second  graph  displays  the  input  to  the  plant  as  calculated  by  the  adaptive  critic  controller.  The  third  and  fourth  graphs  show  the  quality  indexes 
qc(t)  and  qt  (t)  respectively,  along  with  the  thresholds  used. 


After  20  minutes  of  simulation,  the  plant  returns  to  the 
nominal  mode.  Due  to  the  change  in  the  dynamics,  qc(t) 
increases  due  to  the  drop  in  the  performance.  On  the  other  hand, 
0/(0  shows  only  a  thin  spike  indicating  that  there  already  exists 

an  element  in  the  ICDD  that  was  previously  designed  to  deal  with 
a  system  similar  (in  this  case  identical)  to  the  present  one. 
Therefore  switching  takes  place  leading  to  a  much  faster 
response. 

The  second  fault  is  introduced  at  30  minutes.  By  comparing 
with  the  identifier  adapted  for  the  nominal  plant,  the  supervisor 
concludes  that  the  dynamics  are  not  different  enough  to  justify  a 
new  entry  in  the  database.  This  property  is  of  extreme  importance 
in  order  to  achieve  a  database  capable  of  covering  all  the  known 
space,  while  maintaining  a  compact  set  of  recorded  models.  The 
third  fault  on  the  other  hand,  requires  a  major  reconfiguration  in 
the  controller,  and  so  it  is  also  added  to  the  ICDD  after 
convergence. 

After  50  minutes  of  simulation,  the  plant  is  instantly  reverted 
to  the  nominal  model,  and  the  incipient  fault  is  applied  over  it. 
Since  in  the  initial  moments  the  dynamics  are  still  similar  enough 
to  the  ones  of  the  nominal  model,  switching  takes  place  moments 
after  50  minutes.  As  the  parameters  of  the  plant  are  changing,  the 
controller  is  capable  of  constantly  reconfiguring  itself,  and  the 
tracking  error  remains  low.  As  the  dynamics  of  the  plant  get  more 
different  from  the  nominal  ones,  q{(t)  increases  to  the  point 
when  the  supervisor  correctly  diagnoses  the  occurrence  of  an 
incipient  fault.  Around  57  minutes,  qt(t)  once  again  falls  as  the 

input-output  relation  of  the  plant  becomes  increasingly  similar  to 
the  one  stored  when  the  first  fault  was  learned. 

To  illustrate  the  effectiveness  of  the  algorithm  when  a  fault 
presents  itself  for  the  second  time,  fault  3  is  introduced  again  at 
60  minutes.  As  soon  as  the  environment  is  recognized  as  a  known 


one  by  low  values  of  q({t) ,  switching  takes  place  generating  a 
smoother  and  faster  response. 

VI.  Conclusion 

A  multiple  model  approach  to  FTC  based  on  an  intelligent 
dynamic  database  was  presented.  The  application  of  DHP  as  a 
reconfigurable  controller  was  shown  to  give  the  hierarchical 
algorithm  the  amount  of  flexibility  required  to  deal  with  both 
abrupt  and  incipient  changes  in  the  plant.  The  supervisor  system 
was  used  to  accelerate  convergence  of  the  method  by  loading 
new  initial  conditions  to  the  DHP  when  the  plant  is  affected  by  a 
known  abrupt  fault.  A  methodology  was  presented  through  which 
new  fault  scenarios  are  recognized  and  assimilated  on-line  by  the 
database  along  with  parameters  for  the  corresponding  controller. 
Finally,  these  properties  were  successfully  illustrated  in  the  in- 
depth  exploration  of  the  numerical  simulation  example.  Although* 
the  results  so  far  have  been  greatly  encouraging,  formal 
investigation  of  online  stability  and  real-world  complications  is 
required  and  will  be  carried  out  in  future  research. 
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Abstract—  Since  the  1980’s,  the  application  of  Evolutionary  Algorithms  (EA’s)  in  solving 
Multiobjective  Optimization  Problems  (MOPs)  has  been  receiving  a  growing  interest  from  evolutionary 
computation  community.  To  search  for  a  family  of  “acceptable”  solutions,  a  so  called  Pareto  set,  by  using 
EA’s  population-based  parallel  searching  ability,  several  MultiObjective  Evolutionary  Algorithms 
(MOEAs)  have  been  proposed.  However,  most  of  these  MOEAs  have  difficulty  in  dealing  with  the  trade¬ 
off  between  uniformly  distributing  the  computational  resources  and  finding  the  near-complete  and  near- 
optimal  Pareto  set.  On  the  other  hand,  according  to  the  No  Free  Lunch  theorems,  no  formal  assurance  of  an 
algorithm’s  general  effectiveness  exists  if  insufficient  knowledge  of  the  problem  characteristics  is 
incorporated  into  the  algorithm  domain.  In  this  paper,  the  authors  propose  a  new  evolutionary  approach  to 
multiobjective  optimization  problems,  the  Rank-Density  based  Genetic  Algorithm  (RDGA)  that 
synergisticaUy  integrates  selected  features  from  existing  MOEAs  in  a  unique  way.  A  new  ranking  method, 
automatic  accumulated  ranking  strategy,  and  a  “forbidden  region”  concept  are  introduced,  completed  by  a 
revised  adaptive  cell  density  evaluation  scheme  and  a  rank-density  based  fitness  assignment  technique.  In 
addition,  four  types  of  MOP  features,  such  as  discontinuous  and  concave  Pareto  front,  local  optimality, 
high-dimensional  decision  space  and  high-dimensional  objective  space  are  exploited  and  the  corresponding 
MOP  test  functions  are  designed.  By  examining  the  selected  performance  indicators,  RDGA  is  found  to  be 
statistically  competitive  with  four  state-of-the-art  MOEAs  in  terms  of  keeping  the  diversity  of  the 
individuals  along  the  trade-off  surface,  tending  to  extend  the  Pareto  front  to  new  areas  and  finding  a  well- 
approximated  Pareto  optimal  front. 


1.  Introduction 


In  many  scientific  and  engineering  disciplines,  it  is  not  uncommon  to  face  a  design  challenge 
when  there  are  several  criteria  or  design  objectives  to  be  met  simultaneously.  If  these  objectives  are 
conflicting,  then  the  problem  becomes  one  of  finding  the  best  possible  designs  that  satisfy  the  competing 
objectives  under  different  trade-off  scenarios.  With  these  multiple  objectives  and  constraints  taken  into 
consideration,  an  optimum  design  problem  can  then  be  formulated.  This  type  of  problems  is  known  as 
multiobjective,  multicriteria  or  vector  optimization  problems  [1]. 
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Multiobjective  Optimization  (MO)  is  a  very  demanding  research  topic  because  most  real-world 
problems  have  not  only  a  multiobjective  nature,  but  also  many  open  issues  to  be  answered  qualitatively  and 
quantitatively.  In  fact,  there  is  not  even  a  universally  accepted  definition  of  “optimum”  as  in  single¬ 
objective  optimization  [2],  because  the  solution  to  an  MOP  is  generally  more  than  a  single  point.  It  consists 
of  a  family  of  points  in  a  feasible  solution  space,  which  describes  the  trade-off  characters  among 
contradicted  objectives. 


A  formal  notion  of  Pareto  optimality  is  given  by  Fonseca  and  Fleming  in  [3],  Consider,  without 
loss  of  generality,  the  minimization  of  the  n  components  fk,k  =  \,—,n,  of  a  vector  function  f  of  a 

decision  vector  x  in  a  universe  U  ,  where  =  (/,(*), ■••,/„ (x)).  Then  a  decision  vector  xaeU  is  said 
to  be  Pareto-optimal  if  and  only  if  there  is  no  xv  e  17 for  which  v  =  f(xv)  =  (v,,-,v„)  dominates 
u  =  f(x«)  =  (“i  >”■>«„)  >  that  is,  there  is  no  xv  for  such  that 

Vie{l,"-,n},vf  <//,-,  and  3/e  {1, •••,«}, V(-  <u,  . 

The  set  of  all  Pareto-optimal  decision  vectors  is  referred  to  as  the  Pareto-optimal  set  of  the 
problem.  The  corresponding  set  of  objective  vectors  is  called  the  non-dominated  set,  or  Pareto  front. 
Apparently,  the  Pareto  front  dominates  all  other  possible  solutions,  and  in  many  cases,  it  is  located  on  the 

boundary  of  the  objective  space  (i.e.,  feasible  solution  space)  as  shown  in  Figure  1  for  a  two-objective 
optimization  problem. 

1  Points  A,  B:  nondominated  points 
' 2  Point  C:  dominated  point 


Pareto  front 

Figure  1  Graphical  illustration  of  the  Pareto  optimality  of  a  two-objective  minimization  problem 

In  their  early  development.  Evolutionary  Algorithms  (EA’s),  a  class  of  population-based 
optimization  approaches,  have  been  recognized  to  be  well  suited  for  multiobjective  optimization.  In  EA’s, 
multiple  individuals  search  for  multiple  solutions  in  parallel,  advantageously  producing  a  family  of  feasible 
solutions  to  the  problem.  The  ability  to  handle  complex  problems  involving  features  such  as  discontinuity, 
multimodality  and  disjoint  objective  spaces,  reinforces  the  potential  effectiveness  of  EA’s  in  multiobjective 
search  and  optimization,  which  is  perhaps  the  problem  area  where  EA’s  most  distinguish  themselves  from 
the  other  alternatives  [3]. 
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Since  the  1980’s,  several  Multiobjective  Evolutionary  Algorithms  (MOEAs)  have  been  proposed 
and  applied  in  Multiobjective  Optimization  Problems  (MOPs)  [1].  These  algorithms  share  the  same 
purpose  searching  for  a  uniformly  distributed,  near-optimal  and  near-complete  Pareto  front  for  a  given 
MOP.  However,  this  ultimate  goal  is  far  from  being  accomplished  by  the  existing  MOEAs  described  in 
literature.  In  one  respect,  most  of  the  MOPs  are  very  complicated  and  require  the  computational  resources 
to  be  homogenously  distributed  in  a  high-dimensional  search  space.  On  the  other  hand  those  better-fit 
individuals  generally  have  strong  tendencies  to  restrict  searching  efforts  within  local  areas  because  of  the 
genetic  drift  phenomenon  [4-5],  which  results  into  the  loss  of  diversity  due  to  stochastic  sampling.  This 
phenomenon  is  a  well-known  trade-off  decision  pertaining  to  the  efficiency  and  efficacy  dilemma  [6], 

Additionally,  to  show  or  judge  these  MOEAs’  performances,  most  of  the  researchers  used 
numeric  MOPs  as  the  benchmark  problems.  In  MOEA  community,  limited  benchmark  test  functions  are 
frequently  exploited  in  publications,  because  it  makes  the  performance  comparison  of  different  algorithms 
relatively  easy  and  straightforward.  However,  based  on  the  No  Free  Lunch  (NFL)  theorems  [7],  no  formal 
assurance  of  an  algorithm’s  general  effectiveness  exists  if  insufficient  knowledge  of  the  problem  domain  is 
incorporated  into  the  algorithm  design.  Therefore,  visually  comparing  MOEA  performances  on  non¬ 
standard  and  unjustified  numeric  MOPs  does  little  to  determine  a  given  MOEA’s  actual  efficiency  and 
effectiveness.  A  standard  set  of  carefully  designed  benchmark  test  functions  exhibiting  relevant  MOP 
domain  characteristics  can  provide  the  necessary  fair  comparative  basis  [8]. 

In  this  paper,  the  authors  conduct  an  extensive  study  on  four  selective  MOP  features  that  may 
produce  difficulties  for  MOEAs  to  search  for  true  and  uniformly  distributed  Pareto  fronts.  Function  FI  is 
originated  from  an  existing  MOP  to  create  a  discontinuous  and  concave  Pareto  front.  Functions  F2-1  and 
F2-2  are  designed  to  explore  local  and  global  Pareto  optimality  caused  by  objective  functions  and 
constraints,  respectively.  Functions  F3  and  F4  are  test  functions  to  show  the  complications  for  MOEAs  in 
coping  with  MOPs  involving  high-dimensional  decision  and  objective  spaces,  respectively.  Four 
representative  MOEAs  (Fonseca’s  MOGA  [9],  PAES  [10],  SPEA  II  [11]  and  NSGA-H  [12])  were  applied 
to  generate  the  simulation  results  for  each  test  function.  In  addition,  a  new  multiobjective  optimization 
approach,  Rank-Density  based  Genetic  Algorithm  (RDGA),  proposed  in  this  paper  is  also  examined  by 
these  test  functions  for  a  comparative  study.  We  show  that  RDGA  synergistically  integrates  selected 
features  of  existing  MOEAs  in  a  unique  way  and  has  advantages  over  the  other  algorithms  under 
consideration  in  finding  a  near-optimal ,  near-complete  and  uniformly  distributed  Pareto  front.  The 
simulation  results  show  the  proposed  RDGA  is  competitive  with  the  selective  MOEAs  measured  by  some 
performance  metrics.  In  this  paper,  we  make  no  distinction  between  MOEAs  and  MOGAs,  because  then- 
names  are  mostly  chosen  for  historical  reasons.  Different  approaches  have  been  "recombined”,  which 
makes  a  classification  no  longer  feasible.  For  instance,  the  Pareto  Archived  Evolutionary  Strategy  (PAES) 
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[10]  uses  binary  representation  from  GA,  while  Non-dominated  Sorting  Genetic  Algorithm  II  (NSGA-II) 
[12]  uses  a  “plus”  selection  (elitist)  from  evolutionary  strategies. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  reviews  existing  literature  on  the 
most  well  regarded  MOEAs.  Section  3  proposes  a  new  Rank-Density  based  Genetic  Algorithm  that  is 
designed  to  deal  with  high-dimensional  objective  functions,  explore  the  optimality  of  the  candidate  Pareto 
points  and  maintain  the  diversity  of  the  final  Pareto  front.  In  Section  4,  we  discuss  some  typical  features  a 
Pareto  front  may  possess  and  how  to  measure  the  performance  quantitatively.  Section  5  presents  four  MOP 
benchmark  functions  that  create  discontinuous  and  concave  Pareto  front,  local  optimality,  high-dimensional 
decision  space  and  high-dimensional  objective  space.  RDGA  with  four  representative  MOEAs  are 
exploited  to  produce  simulation  results  for  the  given  test  functions.  The  performance  indicators  of  the 
resulting  Pareto  fronts  are  examined  to  compare  the  performances  of  RDGA  and  the  chosen  algorithms. 
Finally,  Section  6  provides  some  concluding  remarks  along  with  pertinent  observations. 

2.  Evolutionary  Multiobjective  Optimization  algorithms 

Generally,  the  approximation  of  the  Pareto-optimal  set  involves  two  conflicting  objectives:  the 
distance  to  the  true  Pareto  front  is  to  be  minimized,  while  the  diversity  of  the  generated  solutions  is  to  be 
maximized  [11].  To  address  the  first  objective,  a  Pareto  based  fitness  assignment  method  is  usually 
designed  in  many  existing  MOEAs  [3]  in  order  to  guide  the  search  toward  the  true  Pareto  optimal  front.  For 
the  second  objective,  some  successful  MOEAs  provide  density  estimation  methods  to  preserve  the 
population  diversity.  In  addition,  several  other  techniques  have  also  been  adopted  such  as  elitism  scheme 
[10]-[12],  crowded  comparison  [9, 12],  archive  truncation  [11],  and  etc.  These  methods  and  techniques  can 

be  found  in  four  state-of-the-art  MOEAs-  MOGA,  PAES,  NSGA-II  and  SPEA  II,  which  are  briefly 
reviewed  in  the  following. 

2.1  Multiobjective  Genetic  Algorithm—  MOGA 

In  their  MOGA  [9],  Fonseca  and  Fleming  proposed  a  ranking  scheme  in  which  the  rank  of  a 
certain  individual  corresponds  to  the  number  of  individuals  in  the  current  population  by  which  it  is 
dominated.  Based  on  this  scheme,  all  the  non-dominated  individuals  are  assigned  rank  1,  while  dominated 
ones  are  penalized  according  to  the  population  density  of  the  corresponding  region  of  the  trade-off  surface. 
Moreover,  to  prevent  premature  convergence  of  the  population,  a  niche-formation  method  to  distribute  the 
population  over  the  Pareto  front  in  the  objective  space  is  adopted.  More  discussions  pertaining  to  the 
ranking  method  in  MOGA  are  given  in  Subsection  3.1. 
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2.2  Pareto  Archive  Evolutionary  Strategy—  PAES 

As  a  local  search  algorithm  that  simulates  a  random  mutation  hill-climbing  strategy,  PAES  may 
represent  the  simplest  possible,  yet  effective,  nontrivial  algorithm  capable  of  generating  diverse  solutions  in 
the  Pareto  optimal  set  [10].  In  PAES,  a  pure  mutation  operation  is  adopted  to  fulfill  a  local  search  scheme. 
A  reference  archive  of  previously  found  non-dominated  solutions  is  updated  at  each  generation  in  order  to 
identify  the  dominance  ranking  of  all  the  resulting  solutions.  Although  (1+1)-PAES  is  originated  as  the 
simplest  version,  PAES  can  also  generate  X  mutants  by  mutating  one  of  the  p  current  solutions,  which  is 

called  (p+X  )-PAES  [10].  Since  PAES  does  not  perform  a  population-based  search,  only  tournament 
selection  can  be  applied  to  determine  the  survivors  of  the  next  generation. 

2.3  Non-dominated  Sorting  Genetic  Algorithm  II —  NSGA-II 

NSGA-H  [12]  was  advanced  from  its  origin,  NSGA  [13].  In  NSGA-II,  a  non-dominated  sorting 
approach  is  used  for  each  individual  to  create  Pareto  rank,  and  a  crowding  distance  assignment  method  is 
applied  to  implement  density  estimation.  In  a  fitness  assignment  between  two  individuals,  NSGA-II  prefers 
the  point  with  a  lower  rank  value,  or  the  point  located  in  a  region  with  fewer  number  of  points  if  both  of  the 
points  belong  to  the  same  front.  Therefore,  by  combining  a  fast  non-dominated  sorting  approach,  an  elitism 
scheme  and  a  parameter-less  sharing  method  with  the  original  NSGA,  NSGA-II  claims  to  produce  a  better 
spread  of  solutions  in  some  testing  problems  [12]. 

2.4  Strength  Pareto  Evolutionary  Algorithm  II —  SPEA II 

Similar  to  NSGA-H,  SPEA  n  [11]  is  an  enhanced  version  of  SPEA  [1],  In  SPEA  H,  instead  of 
calculating  standard  Pareto  rank,  each  individual  in  both  main  population  and  elitist  archive  is  assigned  a 
strength  value,  which  incorporates  both  dominated  and  density  information.  On  the  basis  of  the  strength 
value,  the  fmal  rank  value  is  determined  by  the  summation  of  the  strengths  of  the  individuals  that  dominate 
the  current  one.  Meanwhile,  a  Ath  nearest  neighbor  density  estimation  method  is  applied  to  obtain  the 
density  value  of  each  individual.  The  fmal  fitness  value  is  the  sum  of  rank  and  density  values.  In  addition, 
a  truncation  method  is  used  in  elitist  archive  in  order  to  maintain  the  number  of  individuals  contained  in  the 
archive  to  be  constant.  In  the  experimental  results,  SPEA  H  shows  better  performance  than  SPEA  [1]  over 
all  the  test  functions  considered  therein. 


3.  Rank-Density  Based  Genetic  Algorithm 

From  the  literature  review,  the  main  deficiency  in  the  existing  MOEAs  lies  on  designing  a  suitable 
fitness  assignment  strategy  in  order  to  search  for  a  near-complete  and  near-optimal  approximated  Pareto 
front  for  the  given  optimization  problem.  Unfortunately,  these  two  objectives  are  contradictory.  In  one 
respect,  the  “genetic  drift”  character  of  EA  needs  to  be  exploited  to  converge  the  solution  to  a  near  optimal 
point.  On  the  other  hand,  the  “genetic  drift”  phenomenon  must  be  avoided  in  order  to  sketch  a  uniformly 
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sampled  trade-off  surface  for  the  final  Pareto  front.  Based  on  these  considerations,  a  new  Rank-Density 
based  Genetic  Algorithm  (RDGA),  which  converts  a  high-dimensional  MOP  into  a  bi-objective 
optimization  problem  to  minimize  fitness  rank  values  and  cell  densities,  is  proposed.  Six  crucial  procedures 
of  RDGA  will  be  discussed  as  follows. 

3.1  Automatic  Accumulated  Ranking  Strategy  (AARS) 

First  introduced  by  Goldberg  [14],  and  applied  by  NSGA-II  [12],  the  pure  Pareto  ranking  method 
ensures  that  all  the  non-dominated  individuals  in  the  population  will  be  assigned  rank  1  and  removed  from 
temporary  assertion,  then  a  new  set  of  non-dominated  individuals  will  be  assigned  rank  2,  and  so  forth. 
Fonseca  [9]  further  improved  the  ranking  method  by  including  the  density  information  into  the  rank 
value — an  individual’s  rank  corresponds  to  how  many  individuals  in  the  current  population  that  dominate 

it.  For  example,  consider  an  individual  y  at  generation  t ,  which  is  dominated  by  p individuals  in  the 

current  generation.  Its  rank  value  is  given  by, 

rank(y,t)  =  l  +  pll) .  (2) 

All  the  non-dominated  individuals  are  assigned  rank  value  1,  while  dominated  ones  are  penalized 
according  to  the  population  density  of  the  corresponding  region  of  the  trade-off  surface.  In  addition,  as 
discussed  in  Subsection  2.4,  SPEA  II  [11]  applied  the  summation  of  the  strength  values  and  treated  it  as 
rank  value  to  fulfill  the  same  task.  In  this  paper,  we  propose  an  Automatic  Accumulated  Ranking  Strategy 
(AARS).  In  AARS,  an  individual's  rank  value  is  defined  as  the  summation  of  the  rank  values  of  the 
individuals  that  dominate  it.  Assume  for  the  same  example,  at  generation  t ,  individual  y  is  dominated  by 

p(t)  individuals  yi,y2,-',y p<n  >  whose  rank  values  are  already  known  as  rank(yl9t ),  rank{y2,t), 

rank(y^{t) ,  t) .  Its  rank  value  can  be  computed  by 

p1,) 

rank(y,  f)  =  1  +  £  rank(y  j,t).  (3) 

By  AARS,  all  the  non-dominated  individuals  are  still  assigned  rank  value  1,  while  dominated  ones  are 
penalized  to  reduce  the  population  density  and  redundancy.  For  instance,  suppose  we  want  to  minimis  two 
objectives,  /,  and  /2 ,  and  MOEAs  generate  eleven  individuals.  Their  rank  values  based  on  four  ranking 

techniques  proposed  in  NSGA-II,  MOGA,  SPEA  II  and  AARS  are  illustrated  in  Figure  2,  where  each  dot 
represents  a  candidate  phenotype  solution.  Considering  all  the  individuals  located  in  the  lower-right  area, 
AARS  provides  the  exact  same  rank  values  as  those  computed  by  the  pure  Pareto  ranking  method  (adopted 
by  NSGA-H  [12]),  since  all  the  individuals  are  clearly  aligned  and  not  crowded  at  all.  Therefore,  adding 
extra  density  information  (resulted  from  SPEA  II)  may  not  be  necessary  in  this  case.  Meanwhile,  AARS 
does  impose  a  penalty  to  the  dominated  individuals  located  in  the  upper-left  area.  The  reason  of  penalizing 
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all  the  dominated  individuals  in  this  area  is  because  there  exist  several  non-dominated  individuals  that  can 
mostly  represent  the  dominated  ones.  Therefore,  without  increasing  the  population  size,  the  population 
diversity  will  be  kept  by  penalizing  those  dominated  individuals  by  AARS. 


Figure  2  Individual  rank  values  resulting  from  different  ranking  methods 
3.2  Adaptive  density  value  calculation 

According  to  [11],  although  AARS  and  other  ranking  schemes  [9,11]  provide  a  sort  of  niching 
mechanism  based  on  the  concept  of  Pareto  dominance,  Ihey  may  fail  when  most  individuals  do  not 
dominate  each  other.  Therefore,  additional  density  information  is  incorporated  to  discriminate  between 
individuals  having  identical  raw  fitness  values.  In  RDGA,  to  deal  with  this  problem,  we  adopt  a  modified 
adaptive  cell  density  evaluation  scheme  originated  from  [10]  as  shown  in  Figure  3.  The  cell  width  in  each 
objective  dimension  can  be  formed  as 


d,  = 


max/,(x)-min/,(x) 


K, 


,i  = 


(4) 


where  d,  is  the  width  of  the  cell  in  the  ith  dimension,  K.  denotes  the  number  of  cells  designated  for  the  ith 
dimension  (i.e.,  in  Figure  3,  AT,  =6  and  K2  =4),  and  x  is  taken  from  the  whole  decision  space  X .  As 
the  maximum  and  minimum  fitness  values  in  the  objective  space  will  change  with  different  generations,  the 
cell  size  will  vary  from  generation  to  generation  to  maintain  the  accuracy  of  the  density  calculation.  The 
density  value  of  an  individual  is  defined  as  the  number  of  the  individuals  located  in  the  same  cell.  Note  that 
in  PAES  [10],  the  grid  location  of  a  solution  in  the  objective  space  is  obtained  by  repeatedly  bisecting  the 
range  m  each  objective  and  finding  in  which  half  the  solution  is.  However,  RDGA  uses  a  different  scheme 
to  locate  which  cell  an  individual  belongs  to.  First,  the  cells  are  created  by  dividing  the  range  of  the  current 
objective  space  based  on  K,  and  given  initial  population.  Second,  the  center  position  of  each  cell  is 

obtamed  and  stored.  Third,  each  individual  of  the  initial  population  searches  for  its  nearest  cell  center, 
identifies  this  cell  as  its  “home  address”  and  considers  the  other  individuals  who  share  the  same  “home 
address”  as  its  “family  members”.  Then  for  each  of  these  “homes”,  the  number  of  “family  members”  who 
dwell  in  it  are  counted  and  saved  as  its  density  value.  Fourth,  when  an  offspring  is  generated  and  accepted. 
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its  “home  address”  can  be  easily  located  by  following  the  third  step  and  the  density  value  of  its  home  is 
increased  by  one.  Meanwhile,  if  an  existing  individual  is  eliminated,  its  “home”  is  notified  and  the  density 
value  of  its  “home”  is  decreased  by  one.  Therefore,  at  each  generation,  an  individual  can  access  its  “home 
address”  and  then  obtain  the  corresponding  density  value.  The  “home  address”  is  merely  a  “pointer”  to 
inform  an  individual  where  to  find  its  density  value.  For  instance,  as  shown  in  Figure  3,  the  “home  address” 
and  density  value  of  individual  A  are  (4,3)  and  4,  respectively.  Therefore,  if  a  new  generated  or  a  removed 
individual  does  not  change  the  boundary  of  the  range  of  the  current  objective  space,  only  the  density  value 
of  its  ’’home”  is  changed.  The  density  values  of  the  other  “homes”  (cells)  will  not  be  affected.  This  setting 
can  avoid  the  unnecessary  recalculation  of  an  unchanged  range  of  the  objective  space  and  density  values. 

n 

4 

3 

2 

1 

I  2  3  4  5  6  fl 

Figure  3  Density  map  and  density  grid 

3.3  Rank  and  density  based  fitness  assignment 

Because  rank  and  density  values  represent  fitness  and  population  diversity,  respectively,  we 
assigned  them  as  two  important  attributes  to  each  individual.  Therefore,  any  multiobjective  optimization 
problem  can  be  converted  into  a  bi-objective  optimization  problem.  On  the  other  hand,  since  we  need  to 
minimize  rank  value  together  with  density  value,  some  further  modifications  need  to  be  made  to  the 
original  notation. 

First,  instead  of  minimizing  the  density  valve  of  an  individual,  we  minimize  the  density  value  of 
the  entire  population.  Based  upon  the  definition  of  die  cell  density,  an  individual  located  in  a  crowded  cell 
must  have  a  relatively  higher  density  value,  which  contributes  much  more  to  the  population  density  value 
than  an  individual  in  the  sparse  area  does.  For  example,  a  cell  containing  ten  individuals  will  contribute 
10  x  10  =  100  to  the  population  density  value,  whereas  a  cell  containing  only  one  individual  will  contribute 
1  to  the  population  density  value.  The  average  individual  distance  value  can  be  obtained  by  dividing  the 
current  population  density  value  by  the  current  population  size. 

Second,  after  the  rank  and  density  values  of  each  individual  have  been  extracted,  a  modified 
Vector  Evaluated  Genetic  Algorithm  (VEGA)  [15]  is  applied  to  fulfill  the  fitness  assignment.  It  is  well 
known  that  VEGA  possesses  two  deficiencies:  1)  it  does  not  have  a  scheme  to  maintain  the  diversity  of  the 
evolved  Pareto  front,  and  2)  it  has  difficulty  in  dealing  with  the  problems  involving  concave  trade-off 
surfaces.  However,  as  mentioned  above,  the  goal  of  RDGA  is  to  find  the  non-dominated  individuals  with 
the  rank  value  equal  to  1  and  reduce  the  population  density  value  to  obtain  a  uniformly  distributed  trade-off 
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surface.  In  this  setting,  there  is  no  concern  about  keeping  the  population  diversity  in  the  rank-density 
domain.  Furthermore,  whether  the  “Pareto  front”  in  the  rank-density  domain  is  concave  is  not  an  issue, 
since  it  is  not  a  real  Pareto  front  for  the  MOP  under  consideration.  Therefore,  a  simple  VEGA  is  effective 
enough  to  fulfill  the  fitness  assignment  after  the  original  optimization  problem  has  been  transformed  into 
the  rank-density  domain.  It  is  worthy  of  noting  that  the  idea  of  converting  multiobjective  into  a  domination 
measure  function  and  neighboring  density  function  was  originated  in  [16]  and  later  revised  in  [17]. 
However,  in  their  papers,  two  newly  formulated  objective  functions  are  derived  from  Goldberg’s  ranking 
scheme  [14]  and  Horn’s  niche  sharing  method  [18].  Afterward  they  combine  two  objective  functions  into 
one  nonlinear  fitness  function,  which  is  considered  as  the  final  fitness  function.  Because  rank  and  density 
values  have  distinct  characteristics,  it  is  very  difficult  for  this  algorithm  to  designate  a  suitable  coefficient 
in  ad  hoc  to  bias  the  preference  during  the  evolutionary  process. 
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Figure  4  Illustration  of  the  valid  range  and  the  forbidden  region 


3.4  Crossover  and  mutation 

For  crossover,  the  parent  selection  and  replacement  schemes  are  borrowed  from  Cellular  GA  [19] 
to  explore  the  new  search  area  by  “diffusion.”  For  each  subpopulation,  a  fixed  number  of  parents  are 
randomly  selected  for  crossover.  Then,  each  selected  parent  performs  crossover  with  the  best  individual 
(the  one  with  the  lowest  rank  value)  within  the  same  cell  and  the  nearest  neighboring  cells  that  contain 
individuals.  If  one  offspring  produces  better  fitness  (a  lower  rank  value  or  a  lower  population  density  value) 
than  its  corresponding  parent,  it  replaces  its  parent.  The  replacement  scheme  of  the  mutation  operation  is 
analogous.  Meanwhile,  as  RDGA  takes  the  minimization  of  the  population  density  value  as  one  of  the 
objectives,  it  is  expected  that  the  entire  population  will  move  toward  an  opposite  direction  to  the  Pareto 
front  where  the  population  density  value  is  being  minimized.  Although  moving  away  from  the  true  Pareto 
front  can  reduce  the  population  density  value,  obviously,  these  individuals  are  harmful  to  the  population  to 
converge  to  the  Pareto  front.  To  prevent  “harmful”  offspring  from  surviving  and  affecting  the  evolutionary 
direction  and  speed,  a  forbidden  region  concept  is  proposed  in  the  replacement  scheme  for  the  density 
subpopulation,  thereby  preventing  the  “backward”  effect.  The  forbidden  region  includes  all  the  cells 
dominated  by  the  selected  parent.  The  offspring  located  in  the  forbidden  region  will  not  survive  in  the  next 
generation,  and  thus  the  selected  parent  will  not  be  replaced.  As  shown  in  Figure  4,  suppose  our  goal  is  to 
minimize  objectives  /,  and  f2 ,  and  a  resulting  offspring  of  the  selected  parent  p  is  located  in  the  forbidden 
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region.  By  RDGA,  this  offspring  will  be  eliminated  even  if  it  reduces  the  population  density,  because  this 
kind  of  offspring  has  the  tendency  to  push  the  entire  population  away  from  the  desired  evolutionary 
direction. 

As  discussed  in  Subsection  3.1,  Automatic  Accumulative  Ranking  Strategy  (AARS)  includes  the 
scheme  of  punishing  the  individuals  located  in  crowded  areas,  which  means  we  add  a  bias  to  avoid  the 
population  density  value  from  expanding  too  much  when  RDGA  is  implementing  the  minimization  of  the 
rank  value.  Meanwhile,  a  forbidden  region  is  brought  in  to  introduce  another  bias  to  prevent  the  offspring 
from  having  higher  ranks  than  their  parents  when  RDGA  is  evolving  a  lower  population  density  value. 
Therefore,  RDGA  can  be  interpreted  as  trying  to  convert  an  MOP  into  two  new  single  objective 
optimization  problems —  minimizing  rank  and  density  values,  and  then  performing  an  evolutionary  process 
to  optimize  each  of  the  objectives  in  turn.  It  is  necessary  to  note  that  these  two  biases  make  two  objectives 
of  RDGA  highly  correlated.  When  one  objective  is  being  optimized,  the  corresponding  bias  will  take  the 
other  objective  as  a  constraint  to  keep  the  computation  resources  homogeneously  distributed  between  two 
objectives. 

3.5  Constraint  handling 

To  handle  the  constraints,  every  new  generated  offspring  will  be  tested  against  all  the  constraint 
functions  in  order  to  determine  if  it  is  a  valid  solution.  If  the  offspring  satisfies  all  the  constraints,  it  will  be 
evaluated  by  the  fitness  function  to  obtain  its  fitness  value,  otherwise,  it  will  be  discarded. 


3.6  Archiving  the  candidate  Pareto  points 

The  elitism  scheme  in  [20]  is  also  adopted  in  RDGA.  At  each  generation,  the  non-dominated 
individuals  created  from  the  main  population  will  be  copied  and  stored  to  an  archive.  Meanwhile,  a  non- 
dominated  solution  in  the  archive  may  also  be  selected  with  a  certain  probability  as  a  parent  to  perform 
genetic  operation.  This  probability  p'e  is  called  “elitism  intensity”  and  according  to  [1],  at  each  generation 
t,  the  probability  of  sampling  an  individual  from  the  archive  is  given  by 


\B\ 


\A\+\B\ 


)2. 


(5) 


where  A  and  B  represents  the  archive  of  elitists  and  main  population,  respectively.  After  the  evolution 
process  has  terminated,  the  resulting  solutions  in  both  the  main  population  and  archive  will  be  compared  to 
derive  the  final  Pareto  front. 
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4.  MOP  Test  Function—  Features  and  Indicators 


According  to  [21],  in  order  to  compare  the  performance  of  different  MOEAs,  the  design  of  a 
variety  of  MOP  benchmark  problems  and  performance  metrics  is  essential.  Because  a  multiobjective 
optimization  problem  can  be  closely  related  to  a  combination  of  Single  objective  Optimization  Problems 
(SOPs),  some  literature  review  on  the  features  of  SOP  test  functions  can  be  helpful.  In  De  Jong’s  SOP  test 
bed  study  [22],  he  declared  that  six  problem  characteristics  need  to  be  examined:  continuous  and 
discontinuous,  convex  and  nonconvex,  unimodal  and  multimodal,  quadratic  and  nonquadratic,  low  and 
high  dimensionality,  and  deterministic  and  stochastic.  In  addition,  Michalewicz  [23]  addressed  other  issues 
that  need  to  be  considered  for  the  SOP  test  bed  design,  such  as  the  number  of  constraints,  type  of 
constraints  and  ratio  between  the  feasible  and  complete  search  space.  Apparently,  some  of  these  properties 
are  also  valuable  for  an  MOP  and  must  be  incorporated  into  the  test  bed  design.  Nevertheless,  because  the 
purpose  of  solving  an  MOP  is  to  search  for  a  near-complete  set  of  non-dominated  solutions  (Pareto  front), 
the  features  that  cause  the  true  Pareto  front  difficult  to  be  found  are  the  primary  concerns  in  the  MOP  test 
function  design.  By  now,  in  MOEA  community,  several  well-crafted  benchmark  test  functions  have  been 
designed  and  applied  by  different  researchers  [24-27].  In  this  paper,  we  focus  our  investigation  on  four 
distinct  types  of  MOP  test  functions.  They  are  MOPs  with  discontinuous  and  concave  Pareto  front, 
global/local  optimality,  high-dimensional  decision  space  and  high-dimensional  objective  space. 

We  deploy  five  MOEAs —  MOGA,  PAES,  NSGA-II,  SPEA  II  and  the  proposed  RDGA  in  the 
simulation  and  run  each  of  the  algorithms  50  times  to  obtain  the  statistical  results.  For  each  run,  a  new 
initial  population  with  100  individuals  is  randomly  generated  and  used  by  each  of  the  four  population-based 
MOEAs  (i.e.,  MOGA,  NSGA-II,  SPEA  II  and  RDGA),  while  only  one  initial  individual  is  generated  for 
PAES  according  to  its  design  procedure  [10].  The  archive  size  is  set  to  be  100  for  all  the  selective  MOEAs 
that  involve  elitism  scheme.  We  use  three  indicators  derived  from  the  final  generation  of  50  runs  to 
benchmark  the  comparison  results  via  statistical  Box  plots.  They  are  average  individual  rank  value,  average 
individual  density  value  and  average  individual  distance.  As  discussed  in  Subsection  3.1,  for  an  individual, 
different  ranking  schemes  will  produce  different  rank  values,  which  will  be  used  in  respective  fitness 
evaluations  and  selections.  However,  for  a  fair  comparison  in  terms  of  ranking  indicators  for  different 
MOEAs,  we  use  Goldberg’s  pure  Pareto  ranking  method  [14]  to  recalculate  the  rank  value  for  each 
individual  resulted  from  each  applied  MOEAs.  Meanwhile,  as  shown  in  Figure  3,  the  average  individual 
density  value  is  calculated  as  the  mean  value  of  all  the  individual  density  values  over  the  entire  population. 
Here,  according  to  the  population  size,  we  choose  the  number  of  grids  for  each  objective  dimension  to  be 
20.  Furthermore,  because  the  rank  is  a  relative  value,  it  must  be  stated  that  we  cannot  guarantee  the  final 
population  will  be  a  true  Pareto  set,  even  if  all  of  its  individuals  have  rank  values  equal  to  1  as  shown  in 
Figure  5.  For  this  reason,  we  use  “final  individual  distance”  as  the  third  indicator  to  show  how  far  the  non- 
dominated  points  on  the  resulting  final  Pareto  front  PFfmaJ  are  away  from  the  true  Pareto  front  PFirue , 
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where  PFme  is  known  in  a  priori  for  the  given  test  functions  in  this  paper.  This  indicator  was  originally 
introduced  by  Veldhuizen  and  Lamont  [28],  where  the  final  individual  distance  G  is  defined  as 


cl:, “fy 


m 


(6) 


where  m  is  the  number  of  individuals  in  PFfina1 ,  and  di  is  the  Euclidean  distance  between  each  of  these 
individuals  and  a  point  on  PFtrve  that  is  the  closest  to  it.  A  result  of  G  =  0  indicates  the  convergence 
PF fuwi  =  PFlrue ;  any  other  value  indicates  PFfwaf  deviates  from  PFtrue . 


Figure  5  Difference  between  aDd  PF final 


Moreover,  in  order  to  compare  the  dominance  relationship  between  two  populations  resulted  from 
two  different  MOEAs,  the  coverage  of  two  sets  (C  value)  [1]  is  measured  to  show  how  the  final 
population  of  one  algorithm  dominates  the  final  population  of  another  algorithm.  Function  C  maps  the 
ordered  pair  {Xn X j)  to  the  interval  [0,  1],  where  Xt  and  Xj  denote  the  final  populations  resulted  from 

algorithm  i  and  j9  respectively.  The  value  C( X{ ,  X }  )  =  1  implies  that  all  points  in  X j  are  dominated  by  or 
equal  to  points  in  Xt .  The  opposite,  C(X, ,  X . )  =  0 ,  represents  the  situation  when  none  of  the  points  in 
X j  are  covered  by  the  set  Xr  Note  that  both  CiX^Xj)  and  C^Xj.X.)  need  to  be  considered 
independently  since  they  have  the  distinct  meanings. 


Therefore,  four  indicators  represent  qualitative  measures  that  describe  the  quality  of  the  final  result 
of  selected  MOEAs —  the  average  individual  rank  value  shows  the  dominated  relationship  between 
different  individuals,  the  average  individual  density  value  illustrates  how  good  the  population  diversity  is 
preserved,  the  average  individual  distance  measures  distance  between  PFfinal  and  PFtrue ,  which  provides 

the  quality  of  the  resulting  Pareto  front,  and  the  C  value  compares  the  domination  relationship  of  a  pair  of 
MOEAs.  All  the  values  of  four  performance  indicators  generated  at  the  final  generation  are  illustrated  by 
Box  plots  to  derive  the  statistical  comparison  results. 
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5.  MOP  Test  Function  Case  Study 


To  examine  the  performances  of  the  selected  MOEAs  and  the  proposed  RDGA  on  the  test 
functions  with  different  Pareto  front  features,  we  explore  four  numerical  test  functions  in  the  simulation 
study.  Function  FI  is  advanced  from  an  existing  MOP  to  create  a  discontinuous  and  concave  Pareto  front 
[29].  Functions  F2-1  and  F2-2  are  designed  to  explore  local  and  global  Pareto  optimality  caused  by 
objective  functions  and  constraints,  respectively.  Function  F3  has  a  high-dimensional  decision  space,  while 
function  F4  involves  a  high-dimensional  objective  spice.  For  a  fair  comparison,  the  stopping  generation, 
the  chromosome  length  of  each  decision  variable,  the  crossover  rate  and  the  mutation  rate  are  chosen  to  be 
10,000,  15,  0.7  and  0.1,  respectively  for  all  population-based  MOEAs  considered.  A  one  point  crossover  is 
used  for  all  the  population  based  MOEAs.  In  addition,  we  select  (HIO)-PAES  and  a  bit  flip  mutation  rate 
1/  it  is  used  for  a  chromosome  of  k  genes.  The  tournament  size  tdom  defined  in  [10]  is  chosen  to  be  2. 

5.1  FI —  MOP  with  discontinuous  and  concave  Pareto  front 

The  rationale  of  exploiting  MOPs  with  discontinuous  and  concave  Pareto  fronts  is  that  some 
MOEAs  using  plain  aggregating  schemes  have  been  proven  of  having  difficulty  in  finding  the  Pareto  points 
on  the  discontinuous  and  concave  segments.  MOEA’s  ability  of  finding  a  nonconvex  Pareto  front  is  one  of 
the  most  important  reasons  of  using  EA’s  other  than  traditional  gradient-based  or  simplex-based  algorithms 
in  multiobjective  optimization. 


In  this  paper,  a  modified  Tanaka’s  MOP  [29]  is  chosen  to  be  the  test  function  with  a  discontinuous 
and  concave  Pareto  front. 


Minimize  /, (jCj , x2 )  and  /2(r,,x2) ,  where 
A{x],x2)=x] 

A(x],X2)  =  X2 

subjectto  0  <  x, ,x2  <  n , (x,  -0.5)2 -5(x2 -0.5) 2  <0, 

X 

-  (x?  +  x2  )  + 1  +  0. 1  cos(l  6  arctan(— ))  <  0 . 

x. 


(7) 


(a)  (b) 

Figure  6(a)  Decision  space  and  Pareto  optimal  set  (b)  Objective  Space  and  Pareto  front  of  function  FI 
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Indeed,  die  concave  feature  is  created  by  the  complicated  constraints  imposed  in  Equation  (7).  The 
Pareto  optimal  set  and  the  true  Pareto  front  are  the  same  for  this  problem  since  each  objective  variable  is 
equal  to  one  decision  variable.  Figure  6(a)  shows  the  Pareto  optima!  set,  and  Figure  6(b)  shows  the 
corresponding  Pareto  front,  which  includes  five  discontinuous  segments  and  all  of  them  possess  concavity 
features.  Figure  7(a)  shows  the  true  Pareto  front  and  a  randomly  generated  initial  population.  Using  the 
same  initial  population  for  all  population-based  MOEAs,  Figures  7(b)-(f)  present  the  resulting  Pareto 
fronts  by  five  MOEAs.  The  Box  plots  for  the  average  values  of  three  indicators  over  50  runs  are  illustrated 
in  Figures  8(a),  (b)  and  (c),  respectively.  The  performance  measures  of  C(X,,Xj)  for  the  comparison  sets 

between  algorithms  i  and  j  are  shown  in  Figure  9,  where  algorithms  1-5  represent  MOGA,  NSGA-II, 
PAES,  RDGA  and  SPEA  II  in  alphabetical  order,  respectively. 


fl 


(d)  (e)  (I) 

Figure  7(a)  Ideal  Pareto  front  and  a  randomly  generated  initial  population;  (b)  -  (1)  Pareto  fronts 
resulted  from  (b)  MOGA;  (c)  NSGA-II;  (d)  PAES;  (e)  RDGA;  and  (f)  SPEA  II  on  test  function  Fl 


(a)  (b)  (c) 

Figure  8  Box  plots  of  (a)  average  individual  rank  value;  (b)  average  individual  density  value;  and  (c) 
average  individual  distance  on  test  function  Fl 

Apparently,  comparing  the  resulting  Pareto  fronts  and  indicator  values  in  Figures  1-  9,  we  can  see 
that  MOGA  has  the  lowest  performance  in  terms  of  all  the  indicator  values,  while  the  other  four  MOEAs 
provide  competitive  results.  In  particular,  RDGA  produces  more  complete  Pareto  fronts  than  the  other  four 
MOEAs,  and  it  also  provides  the  highest  C(XA,X^S)  values,  which  means  the  solution  set  resulted  from 
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RDGA  most  likely  dominate  the  rest  of  the  solution  sets  resulted  from  the  other  selective  MOEAs. 
However,  it  is  worthy  to  mention  that  the  solution  set  resulted  from  RDGA  also  has  relatively  high  density 
and  distance  values,  which  can  be  explained  as  RDGA  creates  more  Pareto  points  than  the  other  MOEAs 
and  some  of  these  points  are  not  true  non-dominated  points.  This  problem  can  be  solved  if  we  let  RDGA 
run  a  longer  time  instead  of  the  predetermined  10,000  generations. 


Figure  9  Box  plots  based  on  C  measure  on  test  function  FI 


5.2  F2 —  Local  and  global  Pareto  optimality 

Deb  [30]  proposed  a  multimodal  two-objective  optimization  problem  that  possesses  a  local  and  a 
global  Pareto  front.  He  suggested  th^t  MOEAs  might  have  a  great  tendency  to  converge  to  the  local  Pareto 
front  instead  of  the  global  Pareto  front  if  a  certain  kind  of  initial  population  was  used.  However,  he  did  not 
elaborate  the  detail  of  the  design  procedure  and  how  to  make  the  problem  more  challenging.  Moreover,  a 
further  study  is  needed  if  the  local  optimality  is  caused  by  constraints  instead  of  objective  functions, 
because  two  different  rules  behind  each  of  them  may  result  in  dissimilar  effects. 


5.2.1  F2-1 —  Local  optimality  resulted  by  objective  function 

A  two-variable,  two-objective  local-Pareto  testing  problem  with  a  local  Pareto  front  can  be 
designed  as: 


Minimize  /,(x,,Jt2)  and  /2(Xj,x2)  ,  where 
A(xl,x2)  =  R(xl,x2) 


fl(x  ,,X2)  = 


5(^1,  Xj) 


Ua- yi )3 

where  T(xx , x2 )  =  A  - px  x e  9l  -/?2xe  93 

subject  to  C(xl9x2). 


(8) 
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From  Equation  (8),  we  can  see  in  T(x2 ,  x2 ) ,  parameter  A  affects  the  lowest  bound  of  the  feasible 
solution  space  and  Pareto  front;  /?,  and  p2  determine  the  optimality  of  yx  and  y2  .If  p]  >  p2i  will  be 
the  global  optimal  point,  and  y2  will  be  the  local  optimal  point.  Otherwise,  y2  will  be  the  global  optimum, 
and  yx  will  be  the  local  optimum.  Meanwhile,  the  deviation  between  yx  and  y2  determines  the  distance  of 
the  gap  between  local  and  global  optima.  Parameters  qx  and  q2  determine  how  sharp  the  curves  around  the 
optimal  points  yx  and  y2  will  be.  If  qx  «  q2 ,  a  global  optimal  point  is  created  with  a  spike  around  yx , 
and  the  sharper  the  spike  is,  the  thinner  the  global  Pareto  optimal  set  will  be. 

A  test  function  F2-1  is  created  from  the  general  model  in  Equation  (8)  as: 


Minimize  /,  (*! ,  x2 )  and  f2  (x, ,  x2 ) ,  where 


/i(*n*2)  =  sin(— *,) 


(*,-01)* 


(j^-0.8)3 


_  x_(l-e  0  0001  )+(l-0.5e  08  ) 

2  '  ’  2  arctan(100x,) 


subject  to  0  <  x,  ,x2  <  1 . 


(9) 


In  Equation  (9),  there  are  two  optimal  values  for  x2 ,  xlglobal  =  0.1  and  x2local  =  0.8 ,  which  are 

global  optimum  and  local  optimum  for  f2  (x, ,  x2 ) ,  respectively.  This  effect  will  construct  the  final  local 
and  global  Pareto  fronts  as  shown  in  Figure  10(b)  with  a  sampling  rate  equal  to  0.01  for  both  decision 
variables.  The  true  (global)  Pareto  front  is  a  very  thin  curve,  which  is  separated  from  the  major  range  that 
contains  the  local  Pareto  front. 


xl  fl 


(a)  (b) 

Figure  10(a)  Decision  space  and  Pareto  optima]  set  (b)  Objective  space  and  local  and  global  Pareto 

fronts  of  function  F2-1 

Figure  10(a)  shows  the  decision  space  and  local  and  global  Pareto  optimal  sets,  while  Figure  10(b) 
shows  the  objective  space  and  local  and  global  Pareto  fronts  for  the  test  function  F2-1.  Figures  ll(b)-(f) 
show  resulting  Pareto  fronts  by  five  chosen  MOEAs  for  a  randomly  generated  initial  population,  which  is 
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plotted  in  Figure  11(a)  with  a  true  Pareto  front.  The  Box  plots  for  the  average  values  of  three  indicators 
over  50  runs  are  illustrated  in  Figures  12(a),  (b)  and  (c),  respectively.  The  performance  measures  of 
C(X:,Xj)  for  the  comparison  sets  between  algorithms  i  and  j  are  shown  in  Figure  13,  where  algorithms 

1  -  5  represent  MOGA,  NSGA-II,  PAES,  RDGA  and  SPEA II  in  alphabetical  order,  respectively. 


(«*)  (e)  (0 

Figure  11(a)  Ideal  Pareto  front  and  a  randomly  generated  initial  population;  (b)  -  (1)  Pareto  fronts 
resulted  from  (b)  MOGA;  (c)  NSGA-II;  (d)  PAES;  (e)  RDGA;  and  (f)  SPEA  II  on  test  function  F2-1 


(a)  (b)  (c) 

Figure  12  Box  plots  of  (a)  average  individual  rank  value;  (b)  average  individual  density  value;  and  (c) 
average  individual  distance  on  test  function  F2-1 


From  Figure  13,  we  can  see  that  RDGA  and  SPEA  II  provide  the  best  results.  Particularly, 
RDGA’s  lowest  C  value  is  greater  than  0.8,  which  means  most  of  the  solutions  resulted  from  the  other  four 
MOEAs  are  dominated  or  equal  to  the  solutions  by  RDGA.  Moreover,  RDGA  produces  die  lowest  rank 
and  distance  values.  The  highest  density  values  generated  by  RDGA  and  SPEA  II  are  caused  by  the  partial 
local  and  partial  global  Pareto  fronts  as  shown  in  Figure  1 1(e)  and  (f),  which  may  result  in  a  very  crowded 
partial  global  segment.  From  Figure  11,  it  is  obvious  that  the  resulting  Pareto  front  can  be  pure  global,  pure 
local  or  partial  local  and  partial  global.  Indeed,  the  shapes  of  the  resulting  Pareto  fronts  significantly 
depend  on  different  types  of  initial  populations  for  this  test  function.  Therefore,  two  sets  of  initial 
populations  are  used  for  comparison.  Set  1  includes  50  initial  populations  where  none  of  their  individuals 
belong  to  the  global  Pareto  front.  For  set  2,  at  least  one  individual  is  located  on  the  global  Pareto  front  for 
each  of  50  initial  populations. 
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C(X2,X„) 


C{X„X,_ s) 


Figure  13  Box  plots  based  on  C  measure  on  test  function  F2-1 
Table  1  Final  simulation  results  for  function  F2-1  by  five  MOEAs  using  initial  population  set  1 


Number 
of runs 


Stop 

generation 


Final 
average 
individual 
rank  value 


Final 

average 

individual 

density 

value 


Final 

average 

generation 

distance 


Number 
of  runs 
produce 
pure 
global 
Pareto 
front 


Number 
of  runs 
produce 
local 
Pareto 
front* 


j  Numb 
ofrur 
produi 
partia 
gioba 
Paretc 
front 


MOGA 

"nsga-ii 


50 

IT 


10,000 

10,000 


1.02 


3.21 


0.59 


49 


1 


5.03 


0.51 


45 


PAES 

RDGA 


50 

~50" 


10,000 
"To, 000 


3.54 


0.55 


49 


SPEA  II 


50 


6.15 


0.43 


10,000 


1.01 


5.32 


0.46 


40 


42 


Table  2  Final  simulation  results  for  function  F2-1  by  five  MOEAs  using  initial  population  set  2 


Number 
of  runs 

Stop 

generation 

Final 
average 
individual 
rank  value 

Final 

average 

individual 

density 

value 

Final 

average 

generation 

distance 

Number 
of  runs 
produce 
pure 
global 
Pareto 
front 

Number 
of  runs 
produce 
pure 
local 
Pareto 
front* 

Number 
of  runs 
produce 
partial 
global 
Pareto 
front 

MOGA 

50 

10,000 

1.03 

3.74 

0.14 

37 

0 

13 

NSGA-II 

50 

10,000 

1.03 

3.30 

0.05 

45 

0 

5 

PAES 

50 

10,000 

1 

4.05 

0.09 

41 

0 

9 

RDGA 

50 

10,000 

1.12 

3.44 

0.07 

44 

6 

SPEA  II 

50 

10,000 

1.15 

3.21 

0.06 

44 

0 

6 

♦Note:  In  Table  1  and  2,  we  consider  a  pseudo-global  Pareto  front  as  a  local  Pareto  front 
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Tables  -1  and  2  show  the  indicator  values  for  set  1  and  set  2  correspondingly.  Comparing  the 
observations  from  Table  1  with  Table  2,  we  can  see  that  all  of  the  selected  MOEAs  are  very  sensitive  to  the 
initial  population.  When  the  initial  population  contains  at  least  one  individual  that  belongs  to  the  global 
Pareto  front,  there  will  be  a  higher  probability  for  the  final  population  to  converge  to  the  global  Pareto 
front,  and  otherwise  it  is  most  likely  to  converge  to  a  local  Pareto  front.  Moreover,  different  choices  of 
parameters  A,  pt,  p2,q],q2,y],y2  values  will  produce  various  Pareto  optimality  characteristics.  For 

instance,  Figures  14(a)  and  (b)  show  how  parameters  and  q2  affect  the  selected  MOEAs  in  finding  a 
global  Pareto  front  for  the  initial  population  Set  1  and  2,  respectively.  When  the  ratio  of  q2  /q]  increases, 
the  percentage  that  the  final  population  is  located  on  the  global  Pareto  front  will  decrease  correspondingly. 


00  (b) 

Figure  14  Illustration  of  q2  /  q{  ratio  affects  MOEAs  finding  global  Pareto  front 
(a)  using  initial  population  set  1  and(b)  using  initial  population  set  2 


Figure  15  Pseudo-global  Pareto  fronts  when  x2  approaches  to  x2global  =  0.\(q2/ql  =  10,000)  ratio 

Indeed,  for  q2/qx=  10,000  given  in  Equation  (9),  the  global  Pareto  optimal  set  is  already  very 
thin,  which  means  there  is  only  a  very  small  deviation  from  x2global  =0.1  to  produce  global  Pareto 

optimality.  Even  when  x2  takes  a  very  close  value  to  x2global  =0.1,  such  as  x2  =0.09995,  the  resulting 

Pareto  front  will  not  be  the  global  one,  which  is  shown  in  Figure  15.  From  Figure  15,  we  also  see  that  the 
gap  between  the  local  and  global  Pareto  front  is  not  empty.  Some  pseudo-global  Pareto  fronts  will  emerge 
when  the  y  value  is  getting  close  to  x2global  =  0.1 .  Therefore,  instead  of  being  trapped  by  the  local  Pareto 
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front,  the  resulting  non-dominated  points  may  be  stuck  on  a  pseudo-global  Pareto  front  as  well.  Ibis  effect 
increases  when  the  ratio  of  g2  /  *,  increases.  In  this  scenario,  although  RDGA  may  perform  better  than  the 

other  selected  MOEAs  on  average,  it  will  still  be  difficult  to  fmd  a  global  Pareto  front  if  none  of  the 
individuals  of  the  initial  population  are  located  exactly  on  the  global  Pareto  front. 


5.2.2  F2-2 — Local  optimality  resulted  by  constraint 


Applying  constraints  mav  also  create  the  similar  lrinol  rrlnUnl  _ _ -dJ _ i  ,  . t  .  • 


In  Equation  (10),  parameter*,  =q2,  implies  there  will  not  be  any  spike  in  the  function  T(x,y), 
thus  the  search  space  will  not  be  separated  into  two  parts.  Indeed,  there  is  only  one  optimal  point  for 
T(X,y)  ^  *°-28  •  H°Wever’  as  we  des*gned  a  new  constraint  for  the  decision  variables  in  Equation 

(10),  we  still  can  produce  similar  local-global  optimality  results  shown  in  Figure  16.  In  this  scenario,  the 
global  Pareto  front  and  local  Pareto  front  still  exists,  except  they  are  created  by  a  strict  constraint. 


\UJ 

Figure  16(.)  Deeiriou  rp.ee  tad  « t  to!  uud I  gl.b.l  Pure,,  .pita., «,  objeedve  Sp.«  .nd  tool  .„d 

global  Pareto  fronts  of  function  F2-2 


Under  the  same  conditions,  we  run  four  selected  MOEAs  and  the  proposed  RDGA,  given  initial 

population  set  1  and  set  2  for  comparison.  Tables  3  and  4  show  the  indicator  values  for  set  1  and  set  2 
correspondingly. 


Comparing  the  indicator  values  in  Tables  3  and  4  with  those  in  Tables  1  and  2,  we  can  see  that  for 
the  function  F2-2,  the  global  Pareto  fronts,  resulted  by  adding  constraints,  are  easier  to  be  found  by 
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MOEAs  than  those  resulted  from  objective  functions.  This  occurrence  can  be  explained  as  the  local 
optimality  represented  in  Equation  (8)  having  multiple-layer  pseudo-global  Pareto  fronts,  each  of  which 
contributes  a  new  local  Pareto  front.  In  this  case,  instead  of  finding  the  global  Pareto  front,  MOEAs  are 
easily  trapped  by  a  local  or  pseudo-global  Pareto  front.  Nevertheless,  the  local  optimality  caused  by 
constraints  does  not  enclose  these  pseudo-global  Pareto  fronts.  The  gap  between  local  and  global  Pareto 
fronts  is  completely  blank,  which  means  the  resulting  non-dominated  points  are  most  likely  located  on 
either  of  them,  thus  simplifying  the  searching  complexity. 


Table  3  Final  simulation  results  for  function  F2-2  by  five  MOEAs  using  initial  population  set  1 


Number 
of  runs 

Stop 

generation 

Final 
average 
individual 
rank  value 

Final 

average 

individual 

density 

value 

Final 

average 

generation 

distance 

Number 
of  runs 
produce 
pure 
global 
Pareto 
front 

Number 
of  runs 
produce 
pure 
local 
Pareto 
front 

Number 
of  runs 
produce 
partial 
global 
Pareto 
front 

MOGA 

50 

10,000 

1.21 

3.33 

0.32 

4 

18 

28 

NSGA-II 

50 

10,000 

1 

5.01 

0.27 

6 

15 

29 

PAES 

50 

10,000 

1 

3.96 

0.35 

5 

20 

25 

1  RDGA 

50 

10,000 

1.13 

5.61 

0.22 

9 

13 

28 

SPEA  II 

50 

10,000 

1.08 

5.05 

0.24 

10 

15 

25 

Table  4  Final  simulation  results  for  function  F2-2  by  five  MOEAs  using  initial  population  set  2 


i 

Number 
of  runs 

Stop 

generation 

Final 
average 
individual 
rank  value 

Final 

average 

individual 

density 

value 

Final 

average 

generation 

distance 

Number 
of  runs 
produce 
pure 
global 
Pareto 
front 

Number 
of  runs 
produce 
pure 
local 
Pareto 
front 

Number 
of  runs 
produce 
partial 
global 
Pareto 
front 

MOGA 

50 

10,000 

1.04 

3.20 

0.08 

45 

5 

NSGA-II 

50 

10,000 

1 

4.61 

48 

0 

2 

I  PAES 

50 

10,000 

1 

3.83 

~  44 

0 

6 

RDGA 

50 

10,000 

1 

wm 

48 

0 

2 

SPEA  II 

50 

10,000 

1 

4.52 

0.02 

49 

0 

1 

For  the  local  optimality  created  by  Equation  (10),  the  smaller  the  constraint  range  for 
x 2  gioM  ( 0  0999  <  x2global  <  0.1001  in  Equation  (10))  is,  the  more  difficult  for  MOEAs  to  find  a  real  Pareto 

front  will  be,  because  the  global  Pareto  optimal  set  will  be  a  thinner  band  when  the  constraint  range  is 
small. 
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5.3 


F3 — MOP  with  high-dimensional  decision  space 


Minimize  /,  (*)  and  f2  (x) ,  where 

fi(x)  =  1  -  e "4 *'  sin  6(6nx,) 

fi  M  =  g(xM  -  g(x)  =  1  +  4(X  at,.  / 4) 025  , 

g(x)  w 

subject  to  0  <  x,  <  1,  i  =  1, ... ,5. 


(11) 


This  test  function  is  proposed  in  [12]  as  an  MOP  with  a  high-dimensional  decision  space  and  local 
Pareto  front  in  the  objective  space  as  shown  in  Figure  17.  Figures  18(b)  -  (f)  show  resulting  Pareto  fronts 
by  five  chosen  MOEAs  for  a  randomly  generated  initial  population,  which  is  plotted  as  Figure  18(a)  with  a 
true  Pareto  front.  The  Box  plots  for  the  average  values  of  three  indicators  over  50  runs  are  illustrated  in 
Figures  19(a),  (b)  and  (c),  respectively.  The  performance  measures  of  C(X,,Xj)  for  the  comparison  sets 

between  algorithms  i  and  j  are  shown  in  Figure  20,  where  algorithms  1-5  represent  MOGA,  NSGA-II, 


PAES,  RDGA  and  SPEA II  in  alphabetical  order,  respectively. 


Figure  17  Objective  space  and  Pareto  front  of  F3 


<d)  (e)  (I) 

Figure  18(a)  Ideal  Pareto  front  and  a  randomly  generated  initial  population;  (b)  -  (f)  Pareto  fronts 
resulted  from  (b)  MOGA;  (c)  NSGA-II;  (d)  PAES;  (e)  RDGA;  and  (1)  SPEA  II  on  test  function  F3 


From  Figures  18-  20,  it  is  obvious  that  MOGA  has  great  difficulty  in  finding  the  true  Pareto  front 
of  this  MOP.  On  the  other  hand,  NSGA-II,  SPEA  II  and  RDGA  can  always  identify  some  points  on  the 
global  Pareto  front.  Moreover,  comparing  to  NSGA-II  and  SPEA  II,  RDGA  has  the  lowest  density  value, 
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which  means  RDGA  tends  to  produce  a  more  homogenously  distributed  Pareto  front  by  minimi  ring 
individual’s  density  value  independently. 


(a)  (b)  (c) 

Figure  19  Box  plots  of  (a)  average  individual  rank  value;  (b)  average  individual  density  value;  and  (c) 
average  individual  distance  on  test  function  F3 


c(Jr4,*M)  c(x5,x{_s) 


Figure  20  Box  plots  based  on  C  measure  on  test  function  F3 


Originally  designed  by  Viennet  [31],  this  test  function  has  been  adopted  by  many  researchers  in 
that  it  provides  three  partial-contradict  objective  functions  as  shown  in  Figure  21.  Figures  22(b)-  (f)  show 
resulting  Pareto  fronts  by  five  MOEAs  for  a  randomly  generated  initial  population,  which  is  plotted  in 
Figure  22(a)  with  a  true  Pareto  front.  The  Box  plots  for  the  average  values  of  three  indicators  over  50  runs 
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are  depicted  in  Figures  23(a),  (b)  and  (c),  respectively.  The  performance  measures  of  C{X,,X  j)  for  the 
comparison  sets  between  algorithms  i  and  j  are  shown  in  Figure  24,  where  algorithms  1-  5  represent 


Figure  21(a)  Decision  space  and  Pareto  optimal  set  (b)  Objective  space  and  Pareto  front  of  function  F4 
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Figure  22(a)  Ideal  Pareto  front  and  a  randomly  generated  initial  population;  (b)  -  (f)  Pareto  fronts 
resulted  from  (b)  MOGA;  (c)  NSGA-II;  (d)  PAES;  (e)  RDGA;  and  (f)  SPEA  II  on  test  function  F4 
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Figure  23  Box  plots  of  (a)  average  individual  rank  value;  (b)  average  individual  density  value;  and  (c) 
average  individual  distance  on  test  function  F4 


Indeed,  test  function  F4  possesses  several  challenging  characteristics  such  as  a  high-dimensional 
objective  space,  discontinuous  Pareto  optimal  set  and  several  local  minima  in  objective  functions.  From 
the  resulting  Pareto  fronts  and  Box  plots  of  the  performance  indicators  in  Figure  22-23,  RDGA,  NSGA-II, 
PAES,  and  SPEA  II  all  show  the  ability  to  approximate  the  true  Pareto  front  and  the  population-based 
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MOEAs  (i.e.,  RDGA,  SPEA II  and  NSGA-II)  provide  higher  C  values  as  shown  in  Figure  24.  Furthermore, 
we  can  see  that  RDGA  produces  the  smallest  average  individual  density  value  and  distance  value  compared 
with  NSGA-II  and  SPEA  II.  Because  RDGA  converts  the  original  objective  space  into  a  bi-objective  rank- 
density  domain,  it  is  not  so  sensitive  to  the  complexity  of  high-dimensional  objective  spaces.  Therefore, 
RDGA  holds  the  potential  promise  in  solving  these  types  of  MOPs. 
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Figure  24  Box  plots  based  on  C  measure  on  test  function  F4 


(a) 


(b) 


(c) 


Figure  25  Evolutionary  trajectories  of  (a)  rank;  (b)  density;  and  (c)  distance  values  by  selected  MOEAs  on 

test  function  F4 


As  shown  in  Figure  25(b)  and  (c),  although  NSGA-II  performs  worse  than  RDGA  and  SPEA  II  in 
terms  of  density  preservation  and  distance  minimization,  it  converges  relatively  fast  in  the  rank  domain 
(Figure  25(a)).  This  phenomenon  can  be  partially  credited  to  the  pure  Pareto  ranking  scheme  used  by 
NSGA-II,  which  will  not  be  affected  by  the  density  information  during  the  evolutionary  process.  However, 
fast  convergence  of  the  rank  value  does  not  assure  that  density  and  distance  values  will  converge  fast  as 
well,  and  vice  versa.  As  shown  in  Figures  25(a)-(c),  although  RDGA  converges  much  slower  than  the 
other  three  population-based  MOEAs  in  terms  of  rank  indicator,  it  has  the  fastest  convergence  speed  in 
terms  of  a  distance  indicator  compared  with  all  the  other  selected  MOEAs.  This  effect  can  be  explained  by 
the  restricted  mating  method  and  “forbidden  region”  scheme  applied  by  RDGA.  On  one  hand,  instead  of 
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using  the  roulette  wheel  or  tournament  selection  scheme,  RDGA  randomly  selects  an  individual  as  one  of 
the  parent  to  mate  with  the  best  individual  located  in  the  neighboring  cells,  which  ensures  the  worst 
individuals  have  the  same  probabilities  with  the  elitists  to  be  selected  and  updated  by  their  better  fitted 
offspring.  Although  this  strategy  may  sacrifice  the  convergence  speed  of  an  elitist  in  finding  a  single  true 
non-dominated  point,  it  offers  those  ill  performed  individuals  a  fair  chance  to  catch  up  with  the  better  ones 
and  draws  the  entire  population  to  the  true  Pareto  front.  On  the  other  hand,  the  “forbidden  region”  concept 
prevents  an  individual  from  leading  into  a  wrong  direction  when  the  density  subpopulation  is  evolved.  In 
this  case,  a  newly  generated  offspring  can  survive  not  only  because  it  has  a  lower  density  value  than  its 
corresponding  parent,  but  it  also  has  an  equal  or  higher  rank  value  comparing  to  the  selected  parent.  For 
this  reason,  as  an  extra  constraint  of  RDGA,  the  “forbidden  region”  concept  also  helps  to  compress  the 
entire  population  and  push  it  closer  to  the  true  Pareto  front.  Therefore,  both  “restricted  mating”  and 
“forbidden  region”  techniques  contribute  a  low  variance  and  fast  convergence  of  the  average  individual 
distance  value  as  shown  in  Figure  23(c)  and  Figure  25(c).  Note  that  the  effect  of  these  two  techniques  are 
particularly  significant  for  function  F4 ,  which  may  easily  result  in  an  extremely  high  variance  of  the 
distance  value  during  the  evolutionary  process  if  an  ill-performed  individual  has  never  been  updated  since 
the  beginning.  In  addition,  it  is  worthy  to  note  that  PAES  is  not  a  population-based  algorithm  and  only  non- 
dominated  individuals  are  stored  in  the  archive  at  each  generation.  These  characteristics  distinguish  PAES 
from  other  MOEAs  mainly  in  two  aspects:  its  initial  rank  and  density  values  are  always  equal  to  one,  and 
the  average  individual  rank  value  will  remain  to  be  one  during  the  entire  evolutionary  process.  From  the 
simulation  study,  although  PAES  outperforms  MOGA  for  all  the  test  functions,  it  cannot  provide 
competitive  results  compared  with  the  other  two  most  advanced  MOEAs  (i.e.,  NSGA-II  and  SPEA  II)  and 
the  proposed  RDGA  in  terms  of  rank,  density,  distance  indicators  and  C  measure. 

6.  Conclusions 

In  this  paper,  a  new  multiobjective  evolutionary  algorithm — Rank  Density  based  Genetic 
Algorithm  (RDGA)  is  proposed.  RDGA  can  be  characterized  as  a)  simplifying  the  problem  domain  by 
converting  high-dimensional  multiple  objectives  into  two  objectives  to  minimize  the  individual  rank  value 
and  population  density  value,  b)  searching  for  and  keeping  better-approximated  Pareto  points  by  diffusion 
and  elitism  schemes,  c)  incorporating  density  information  into  Pareto  ranking  strategy  and  d)  preventing 
harmful  individuals  by  introducing  a  “forbidden  region”  concept.  From  the  results  presented  above,  RDGA 
has  shown  its  potential  in  producing  statistically  competitive  results  with  the  four  state-of-the-art 
MOEAs —  Fonseca’s  MOGA,  PAES,  NSGA-II  and  SPEA  II  on  four  types  of  multiobjective  optimization 
problems,  which  are  designed  to  exploit  various  complications  in  finding  near-optimal ,  near-complete  and 
uniformly  distributed  true  Pareto  fronts. 
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For  the  MOP  test  functions  that  only  possess  discontinuous  or  concave  Pareto  fronts,  the  recent 
developed  approaches — RDGA,  NSGA-II,  PAES  and  SPEA-II  do  not  have  much  trouble  in  finding  some 
points  of  the  true  Pareto  front,  and  RDGA  is  found  to  show  better  performance  in  keeping  the  diversity  of 
the  individuals  along  the  current  trade-off  surface,  extending  the  Pareto  front  to  new  areas,  and  finding  a 
well-approximated,  non-dominated  set.  However,  without  cautious  selection  of  an  initial  population,  an 
MOP  with  a  feature  of  local  optimality  will  easily  cause  most  MOEAs  problems  in  finding  a  pure  global 
Pareto  front.  A  local  Pareto  front  created  by  constraints  may  produce  less  difficulty  than  what  is  generated 
by  objective  functions  if  the  former  one  does  not  contain  pseudo-global  Pareto  fronts.  In  addition,  two 
complicated  MOP  test  functions  with  high-dimensional  decision  space  and  objective  space  are  examined  by 
RDGA  and  the  selected  MOEAs.  The  experimental  results  demonstrate  that  RDGA  produces  statistically 
competitive  results  with  other  representative  Pareto-based  MOEAs  in  finding  a  near-optimal ,  near- 
complete  and  uniformly  distributed  Pareto  front.  Furthermore,  as  the  test  functions  used  in  this  paper  are 
still  far  from  embodying  a  complete  MOP  test  suite,  a  more  profound  study  in  developing  a  general  model 
of  MOEA  and  designing  a  more  representative  test  function  set  in  this  field  is  absolutely  necessary  in  future 
work. 


References 

[1]  E.  Zitzler  and  L.  Thiele,  “Multiobjective  evolutionary  algorithms:  a  comparative  case  study  and 
the  strength  Pareto  approach,”  IEEE  Trans.  Evol.  Compute  vol.  3,  pp.  257-271,  Nov.  1999. 

[2]  C.  L.  Hwang  and  A.  S.  M.  Masud,  Multiple  Objective  Decision  Making— Methods  and 
Applications ,  Vol.  164  of  Lecture  Notes  in  Economics  and  Mathematical  Systems.  Berlin, 
Germany:  Springer- Verlag,  1979. 

[3]  C.  M.  Fonseca  and  P.  J.  Fleming,  “An  overview  of  evolutionary  algorithms  in  multiobjective 
optimization,”  Evol.  Comput,  v ol.  3,  pp.  1-16,  1995. 

[4]  S.  W.  Mahfoud,  “Genetic  drift  in  sharing  methods,”  in  Proc.  1st  IEEE  Cong.  Evolutionary 
Computation,  vol.  1,  pp.  67-72,  1994. 

[5]  A.  Petrowski,  “A  clearing  procedure  as  a  niching  method  for  genetic  algorithms,”  in  Proc.  3rd 
IEEE  Cong.  Evolutionary  Computation,  pp.  798-803,  1996. 

[6]  S.  Ricardc  and  B.  Amnon,  “Evolutionary  strategies  for  a  parallel  multi-objective  genetic 
algorithm,”  in  Proc.  9th  Int.  Conf.  Genetic  Algorithms,  pp.  227-234, 2000. 

[7]  D.  H.  Wolpert  and  W.  G.  Macready,  “No  free  lunch  theorems  for  optimization,”  IEEE  Trans. 
Evol  Comput.,  vol.  1,  pp.  67-82,  April  1997. 

[8]  D.  A.  Van  Veldhuizen  and  G.  B.  Lamont,  Multiobjective  Evolutionary  Algorithm  Research:  A 
History  and  Analysis ,  Technical  Report  TR-98-03,  Air  Force  Institute  of  Technology,  1998. 

[9]  C.  M.  Fonseca  and  P.  J.  Fleming,  “Multiobjective  optimization  and  multiple  constraint  handling 
with  evolutionary  algorithms — part  I:  A  unified  formulation,”  IEEE  Trans.  System ,  Man ,  and 
Cybernetics ,  vol.  28,  pp.  26-37,  Jan.  1998. 

[10]  J.  D.  Knowles  and  D.  W.  Come,  “Approximating  the  non-dominated  front  using  the  Pareto 
archived  evolutionary  strategy,”  Evol  Comput ,  vol.  8,  pp.  149-172,  2000. 

[11]  E.  Zitzler,  M.  Laumanns  and  L.  Thiele,  SPEA2:  Improving  the  Strength  Pareto  Evolutionary 
Algorithm ,  Technical  Report  TIK-Report  103,  Swiss  Federal  Institute  of  Technology,  2001. 

[12]  K.  Deb,  S.  Agrawal,  A.  Pratab,  and  T.  Meyarivan,  “A  fast  elitist  non-dominated  sorting  genetic 
algorithm  for  multi-objective  optimization:  NSGA-II,”  in  Proc.  Conf  Parallel  Problem  Solving 
from  Nature  VI,  pp.  849-858, 2000. 

[13]  N.  Srinivas  and  K.  Deb,  “Multi-Objective  function  optimization  using  non-dominated  sorting 
genetic  algorithms,”  Evol  Comput,  vol.  2,  pp.  221-248, 1994. 


27 


[14]  D.  E.  Goldberg,  Genetic  Algorithms  in  Search,  Optimization  and  Machine  Learning.  Reading , 
MA:  Addison-Wesley,  1989. 

[15]  J.  D.  Schaffer,  “Multiple  objective  optimization  with  vector  evaluated  genetic  algorithms,”  in 
Proc.  1st  Ini.  Conf.  Genetic  Algorithms,  pp.  93-100.  1985. 

[16]  M.  Valenzuela-Rendon  and  E.  Uresti-Charre,  “A  non-generational  genetic  algorithm  for 
multiobjective  optimization,”  in  Proc.  7th  Int.  Conf.  Genetic  Algorithms,  pp.  658-665,  1997. 

[17]  C.  C.  H.  Borges  and  H.  J.  C.  Barbosa,  “A  non-generational  genetic  algorithm  for  multiobjective 
optimization,  in  Proc.  7th  IEEE  Cong.  Evolutionary  Computation,  pp.  172- 179,  2000. 

[18]  J.  Horn,  N.  Nafpliotis  and  D.  E.  Goldberg,  “A  niched  pareto  genetic  algorithm  for  multiobjetcive 
optimization,”  in  Proc.  1st  IEEE  Cong.  Evolutionary  Computation,  vol.  l,pp.  82-87,  1994. 

[19]  T.  Krink  and  R.  K.  Ursem,  “Parameter  control  using  agent  based  patchwork  model,”  in  Proc.  7*h 
IEEE  Cong .  Evolutionary  Computation,  pp.  77-83,  2000. 

[20]  M.  Laumanns,  E.  Zitzler,  and  L.  Thiele,  “  A  unified  model  for  multi-objective  evolutionary 
algorithms  with  elitism,”  in  Proc.  7th  IEEE  Cong.  Evolutionary  Computation,  pp.  46-53,  2000. 

[21]  D.  A.  Van  Veldhuizen,  Multiobjective  Evolutionary  Algorithms:  Classifications,  Analyses,  and 
New  Innovations.  PhD  thesis,  Department  of  Electrical  and  Computer  Engineering,  Air  Force 
Institute  of  Technology,  Wright-Patterson  AFB,  Ohio,  May  1999. 

[22]  K.  A.  De  Jong,  An  Analysis  of  Behavior  of  a  Class  of  Genetic  Adaptive  Systems ,  PhD  dissertation, 
Department  of  Computer  Science,  The  University  of  Michigan,  Ann  Arbor  MI,  1975. 

[23]  Z.  Michalewicz,  “Genetic  algorithms,  numerical  optimization,  and  constraints,”  in  Proc.  6th  Int. 
Conf  Genetic  Algorithms,  pp.  151-158,  1995. 

[24]  C.  M.  Fonseca  and  P.  J.  Fleming,  “On  the  performance  assessment  and  comparison  of  stochastic 
multiobjective  optimizers,”  in  Proc.  Conf.  Parallel  Problem  Solving  from  Natrue—PPSN  IV,  pp. 
584-593, 1996. 

[25]  D.  A.  Van  Veldhuizen  and  G.  B.  Lamont,  “Multiobjective  evolutionary  algorithm  test  suite,”  in 
Proc.  ACM Symp.  on  Applied  Computing,  pp.  351-357, 1999. 

[26]  E.  Zitzler,  K.  Deb,  and  L.  Thiele,  “Comparison  of  multiobjective  evolutionary  algorithms: 
Empirical  results,”  Evol.  Comput.,  vol.  8,  pp.  173-195,  2000. 

[27]  J.  D.  Knowles  and  D.  W.  Come,  “Benchmark  problem  generators  and  results  for  the 
multiobjective  degree-constraint  minimum  spanning  tree  problem,”  in  Proc.  Conf  Genetic  and 
Evolutionary  Computation  ( GECCO-2001 ),  pp.  424-431,  2001. 

[28]  D.  A.  Van  Veldhuizen  and  G.  B.  Lamont,  “On  measuring  multiobjective  evolutionary  algorithm 
performance,”  in  Proc.  7th  IEEE  Cong.  Evolutionary  Computation,  pp.  204-211,  2000. 

[29]  M.  Tanaka,  “GA-based  decision  support  system  for  multicriteria  optimization,”  in  Proc.  Int.  Conf. 
Systems,  Man ,  and  Cybernetics,  pp.  1556-1561;  1995. 

[30]  K.  Deb,  “Multiobjective  genetic  algorithms;  problem  difficulties  and  construction  of  test 
problems,”  Evol  Comput .,  vol.  7,  pp.  205-230, 1999. 

[31]  R.  Viennet,  “Multicriteria  optimization  using  a  genetic  algorithm  for  determining  a  Pareto  front,” 
International  Journal  of  Systems  Science,  vol.  2,  pp.  255-260,  1996. 


28 


APPENDIX  J: 


Automatic  Frog  Calls  Monitoring  System: 

A  Machine  Learning  Approach 

by 

Gary  G.  Yen  and  Qiang  Fu 

International  Journal  of  Computational  Intelligence  and  Applications,  1(2),  2001,  pp. 

165-186 


International  Journal  of  Computational  Intelligence  and  Applications 
Vol.  1,  No.  2  (2001)  165-186 
©  Imperial  College  Press 


AUTOMATIC  FROG  CALLS  MONITORING  SYSTEM: 
A  MACHINE  LEARNING  APPROACH 


GARY  G.  YEN  and  QIANG  FU 

Intelligent  Systems  and  Control  Laboratory 
School  of  Electrical  and  Computer  Engineering 
Oklahoma  State  University,  Stillwater,  OK  74078,  USA 

Received  25  September  2000 
Revised  27  February  2001 


Automatic  recognition  of  frog  vocalization  is  considered  a  valuable  tool  for  a  variety 
of  biological  research  and  environmental  monitoring  applications.  In  this  research  an 
automatic  monitoring  system,  which  can  recognize  the  vocalizations  of  four  species  of 
frogs  and  can  identify  different  individuals  within  the  species  of  interest,  is  proposed. 
For  the  desired  monitoring  system,  species  identification  is  performed  first  with  the 
proposed  filtering  and  grouping  algorithm.  Individual  identification,  which  can  estimate 
frog  population  within  the  specific  species,  is  performed  in  the  second  stage.  Digital 
signal  pre-processing,  feature  extraction,  dimensionality  reduction,  and  neural  network 
pattern  classification  are  performed  step  by  step  in  this  stage.  Wavelet  Packet  feature 
extraction  together  with  two  different  dimension  reduction  algorithms  are  synergistically 
integrated  to  produce  final  feature  vectors,  which  are  to  be  fed  into  a  neural  network 
classifier.  The  simulation  results  show  the  promising  future  of  deploying  an  array  of 
continuous,  on-line  environmental  monitoring  systems  based  upon  nonintrusive  analysis 
of  animal  calls. 

Keywords :  Frog  calls  monitoring,  feature  extraction,  dimensionality  reduction,  pattern 
classification. 


1.  Introduction 

Recently,  there  is  an  increasing  interest  and  expenditure  in  environmental  moni¬ 
toring,  both  in  North  America  and  around  the  world.  It  is  becoming  essential  to 
predict  and  assess  the  environmental  impact  of  human  activities  on  plants  and  an¬ 
imals.  The  populations  of  certain  kinds  of  animals  like  birds  and  frogs  are  excellent 
indicators  of  overall  environmental  health.  As  many  of  the  animals  in  an  area  may 
be  heard  but  not  seen,  it  is  convenient  to  rely  on  their  sounds  as  a  means  of  identi¬ 
fication.  In  many  places  manual  census  is  not  feasible,  if  not  completely  impossible. 
As  a  result,  automatic  recognition  of  animal  sounds  is  considered  a  valuable  tool 
for  biological  research  and  environmental  monitoring  applications. 

The  frog  is  a  small  and  tailless  animal.  Most  frogs  have  moist  skin.  They  typ¬ 
ically  live  both  on  land  and  in  water.  Toads  are  very  similar  to  frogs  except  that 
toads  typically  have  rough  and  dry  skin  and  they  often  live  in  drier  habitats.  Frogs 
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and  toads  are  commonly  acknowledged  as  the  major  divisions  in  amphibians.  Am¬ 
phibians  lead  a  double  life,  alternately  on  land  and  in  water.  Typical  amphibians 
include  toads,  newts,  and  salamanders  as  well  as  frogs.  They  usually  live  in  tempo¬ 
rary  or  permanent  wetland  areas.  Frogs  are  of  great  importance  to  humans.  They 
are  carnivores  and  consume  large  quantities  of  insects,  worms,  and  other  small  crea¬ 
tures.  In  turn,  they  may  be  a  food  source  for  other  animals  such  as  snakes.  Frogs 
and  toads  are  integral  parts  of  the  food  web.  Many  researchers  in  different  fields  sire 
interested  in  frogs  and  toads  because  they  are  considered  to  be  bioindicators.  The 
health  of  frog  populations  is  thought  to  reflect  the  health  of  the  whole  ecosystem. 
Since  early  1980s,  scientists  have  reported  startling  declines  in  the  populations  of 
some  species  of  frogs.1  These  declines  have  occurred  globally.  Although  the  reason 
for  population  decline  remains  a  mystery,  there  arises  a  variety  of  popular  hypothe¬ 
ses  and  possible  justifications,  such  as  fluctuations  caused  partly  by  climatological 
changes,  increased  ultraviolet  light  due  to  anthropogenically  caused  ozone  deple¬ 
tion,  diseases,  and  introduction  of  exotic  species  (e.g.  bullfrogs).  All  of  the  above 
justifications  make  frog  population  an  excellent  indicator  of  environmental  health, 
particularly  in  aquatic  habitats  because  of  their  biphasic  (aquatic  and  terrestrial) 
life.  Causes  for  frog  population  declines  remain  shrouded  in  mystery  despite  in¬ 
creased  worldwide  research  efforts.  Because  of  the  difficulty  and  expensiveness  of 
the  censusing  population  of  specific  frog  species,  a  conclusive  analysis  based  on  the 
estimation  of  frog  population  is  not  yet  available.  Manual  field  tracing  of  frog  calls 
in  extremely  hot  and  high  humidity  wetlands  for  an  extensive  period  of  time  is  very 
difficult.  In  addition,  the  calling  activities  of  most  species  are  irregular,  depending 
primarily  on  rainfall  and  season.  As  a  result,  short  field  trips  to  these  areas  are  not 
a  reliable  method  to  census  the  frog  populations. 

Frogs,  as  well  as  birds  and  whales,  have  developed  the  use  of  sound  as  the 
principal  means  of  distant  communication.  Most  species  of  frogs  can  produce  two 
types  of  calls,  a  distress  call  and  an  advertisement  call.  Both  males  and  females  can 
make  distress  calls  when  they  are  in  danger.  Only  males  can  produce  advertisement 
calls,  which  are  used  to  convey  such  information  as  location  and  breeding  readiness 
to- both  sexes.  Advertisement  calls  can  be  used  to  identify  the  species  of  frogs. 
Auditory  recognition  of  frogs  is  one  feasible  way  to  estimate  frog  population  in 
the  area  of  interest.  Therefore,  an  automatic  monitoring  system,  which  remotely 
monitors  calling  anurans  in  the  harsh  environment,  needs  to  be  established  with 
in-place  environmental  monitoring  infrastructure,  such  as  Mesonet.2 

The  monitoring  system,  which  does  not  require  an  expert  attendance,  deploys 
a  directional  microphone  to  capture  frog  calls  continuously  in  the  field,  records 
sound  signals  into  digital  audio  tapes,  and  translates  them  into  digital  audio  data 
files.  Species  identification  and  individual  identification  within  some  species  can 
be  automatically  carried  out  in  this  monitoring  system.  Useful  information  such 
as  the  number  of  species  identified  and  their  approximate  population  are  then 
transmitted  via  environmental  monitoring  network  for  follow-up  decision-making. 
The  successful  development  of  this  automatic  monitoring  system  will  provide  a 
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Fig.  1.  Automatic  frog  call  monitoring  system  architecture. 

robust  measurement  to  quantify  environmental  pollution.  This  system  will  greatly 
facilitate  research  to  monitor  the  amphibian  population  as  an  indicator  for  en¬ 
vironmental  and  water  quality.  Figure  1  depicts  the  architecture  of  the  system 
envisioned. 

By  developing  such  an  automatic  monitoring  system  which  may  sustain  the 
harsh  environment  in  the  field  for  a  long  period  of  time,,  it  has  become  possible  to 
continuously  monitor  the  frog  calls  and  to  reliably  estimate  the  trend  of  frog  pop¬ 
ulation  within  specific  species  of  interest.  The  remaining  of  the  paper  is  organized 
as  follows.  Section  2  provides  a  literature  review  for  various  representative  animal 
sound  identification  systems.  Section  3  presents  the  proposed  method  of  species 
identification,  which  is  proven  to  be  efficient  and  advanced  from  those  mentioned 
in  Sec.  2.  Section  4  shows  how  mdividual  identification  works,  which  includes  three 
major  parts:  signal  preprocessing,  feature  vector  dimension  reduction,  and  pattern 
classification.  Section  5  discusses  the  simulation  results  of  species  and  individual 
identifications  based  on  the  available  data  sets.  Section  6  draws  the  conclusion  of 
the  current  research  work. 


2-  Literature  Review 

Among  scarce  literature,  most  of  research  works  on  automatic  recognition  of  ani¬ 
mal  vocalizations  focused  on  species  identification,  i.e.  to  identify  different  species 
of  animals  according  to  recorded  vocalizations.  Only  a  few  research  efforts  have 
been  dedicated  to  quantify  the  repertoire  of  a  single  species  and  thus  estimating 
population  within  the  same  species.6  Based  on  different  characteristics  of  vari¬ 
ous  animal  sounds,  there  are  different  methods  to  perform  automatic  or  manual 
recognition.  Basically  all  recognition  systems  include  two  stages:  digital  signal  pre¬ 
processing  and  pattern  classification.  These  two  stages  will  be  separately  discussed 
as  follows. 
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2.1.  Methods  of  digital  signal  pre-processing 

The  purpose  of  digital  signal  pre-processing  is  to  extract  a  temporal  measurement 
which  contains  useful  information  from  the  original  data.  As  well  known,  these  large 
volumes  of  original  data  sets  generally  contain  only  sparse  segments  of  useful  calls. 
These  calls  often  have  weak  signal  strength  and  are  possibly  buried  in  interference, 
and  usually  consist  of  somewhat  similar  noises  form  other  animals  and  the  environ¬ 
ment.  Through  the  use  of  pre-processing,  we  can  extract  only  those  useful  signals 
critical  for  pattern  recognition  usage. 

As  with  human  speech,  animal  sounds  can  be  sensibly  interpreted  using  a  time- 
frequency  representation  method.  Thus,  tools  designed  for  human  speech  analysis 
are  commonly  used  for  animal  sound  classification  purpose.  Generally  this  includes 
time  domain  methods  such  as  linear  predictive  coding,3’4  frequency  domain  methods 
such  as  Fourier  transform,  time-frequency  domain  methods  such  as  time  dependent 
Fourier  transforms,  spectrogram,5,6  and  time-scale  domain  methods  such  as  wavelet 
transforms.7  In  addition,  biologists  have  considered  zero-crossing  analysis,  autocor¬ 
relation  functions,  cepstral  analysis,  power  spectral  density  (Welch  method),  and 
Wigner-Ville  transforms  as  tools  for  pre-processing  of  signals.  In  comparison  with 
the  human  speech  recognition  problem,  animal  sounds  axe  usually  simpler  to  recog¬ 
nize  than  speech  utterances.  Their  recognition  would  be  an  easy  problem  if  it  was 
conducted  under  similar  conditions  to  that  of  most  successfully  deployed  speech 
recognition  systems:  a  single  cooperative  individual  close  to  the  microphone  in  a 
quiet  environment.8  Unfortunately,  these  animal  sounds  are  usually  recorded  in  a 
much  noisy  environment  which  means  that  we  must  recognize  simpler  vocaliza¬ 
tions  under  much  more  difficult  conditions.  That  is  the  main  difference  between 
human  speech  recognition  and  animal  sound  recognition.  Work  in  the  former  case 
focuses  on  the  utterances,  while  the  latter  focuses  on  robustly  handling  the  record¬ 
ing  conditions. 

The  noisy  nature  of  animal  sound  identification  leads  to  one  conclusion:  all  those 
signal  processing  methods  mentioned  above  provide  only  the  necessary  tools  for  the 
process  of  signal  pre-processing.  There  is  still  much  sufficient  work  to  be  done  to 
extract  the  unique  features  from  raw  signals. 

High-level  environmental  noise  may  unnecessarily  complicate  the  identification 
process  than  the  simulations  conducted  in  laboratory  environments.  How  to  realize 
noise  cancellation  or  noise  reduction  scheme,  especially  for  those  noises  with  similar 
frequency  range  as  frog  calls,  remains  a  challenging  issue. 

2.2.  Methods  of  pattern  classification 

Pattern  classification  forms  a  fundamental  solution  to  different  problems  in  real 
world  applications.  The  function  of  pattern  classification  is  to  categorize  an  un¬ 
known  pattern  into  a  distinct  class  based  on  a  suitable  similarity  measure.  Thus, 
similar  patterns  are  assigned  to  the  same  classes  while  dissimilar  patterns  axe  clas¬ 
sified  into  different  classes. 
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Engineers  and  scientists  have  developed  many  methodologies  to  deal  with 
classification  problems.  In  animal  sound  identification  systems,  several  meth¬ 
ods  have  been  used.  The  most  popular  methods  axe  neural  network  classifier,0,9 
decision-tree  classifier,6  Bayesian  Classifier4  and  statistical  pattern  classifier.9,10 
Some  researchers  combined  more  than  one  method  together  in  their  classifier  de¬ 
signs.  Most  existing  classifiers  can  only  recognize  those  already  known  classes  and 
demands  an  iterative  design  process  of  a  long  training  time.  In  a  near  future,  the 
proposed  frog  call  monitoring  system  is  required  to  be  placed  in  field  for  real  time 
monitoring.  It  wall  not  have  any  prior  knowledge  about  any  individual  frog  ca1!  pat¬ 
tern  and  neither  will  it  know  how  many  individuals  are  calling  during  the  recording 
period.  In  this  case,  the  classifier  must  be  built  with  on-line  learning  ability  that 
can  learn  new  patterns  in  real-time  and  continuously  grow  the  number  of  identified 
individual  numbers  without  re- training.  Popular  Multi-layer  Perceptron  network 
is  clearly  defeated  by  this  specification.  The  Incremental  Learning  Fuzzy  Neural 
Network  proposed18,19  can  address  this  deficiency.  It  uses  an  incremental  learning 
algorithm  and  can  detect  new  classes  of  signatures  and  update  its  parameters  while 
in  an  operating  mode.  And  it  has  an  on-line  (real-time)  and  fast  learning  algorithm 
without  knowing  a  priori  information. 

3.  Frog  Species  Identification 

3-1-  Data  acquisition  of  frog  call  signals 

The  proposed  frog  call  monitoring  system  is  based  upon  in-field  acquisition  of  nat¬ 
ural  sound  signals.  One  directional  microphone,  Telinga  Pro  V  Mono  Parabolic  mi¬ 
crophone,  was  used  to  collect  audio  signals.  It  has  frequency  response:  40-18,000  Hz: 
+/  —  3  dB.  The  microphone  was  connected  with  a  SONY  PCM-M1  digital  audio 
recorder.  Audio  signals  were  stored  in  digital  audiotape  (DAT).  Each  DAT  has  a 
capacity  of  120  minutes.  The  Turtle  Beach  System,  which  includes  Turtle  Beach 
Montego  II  sound  card  mounted  on  the  PC,  Turtle  Beach  AudioStation,  Turtle 
Beach  Audio  View  and  additional  supporting  software,  was  utilized  to  transform 
signals  stored  in  DAT  into  the  computer  in  digitized  WAVE  format,  Each  WAVE 
file  has  PCM  (Pulse  Code  Modulation)  format  with  8000  Hz  sampling  frequency, 
16-bit  accuracy  and  monotony. 

The  capacity  of  each  WAVE  file  is  16  KByte  per  second.  Considering  the  com¬ 
putational  complexity  and  the  characteristics  of  frog  calls,  each  WAVE  file  was 
chosen  to  be  approximately  10-second  long.  To  analyze  data  in  real-time,  the  com¬ 
puter  system  should  carry  on  species  identification  and  the  corresponding  individual 
identification  within  10  seconds.  Later  a  Pentium  HI  500  PC  was  used  for  simula¬ 
tion  purpose.  It  was  found  that  all  numerical  computations  could  be  done  within 
one  or  two  seconds.  That  is  well  below  10  seconds.  Hence,  a  length  of  10-second 
file  segment  is  practically  reasonable.  After  developing  the  whole  hardware  system, 
sound  signals  can  be  separated  into  10-second  long  intervals  automatically  and  be 
analyzed  one  by  one  in  real-time. 
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3.2.  Spectrogram  analysis 

All  frog  falls  acquisition  works  were  performed  in  Stillwater,  Oklahoma.  Frog  calls 
of  four  different  species  have  been  collected  from  January  to  May  2000.  These  were 
the  southern  leopard  frog  (RAUT),  American  toad  (BUAM),  streckeris  chorus  frog 
(PSST),  and  spotted  chorus  frog  (PSCL).  Advertisement  frog  calls  are  species- 
specific.  Some  early  experiments  have  shown  the  fact  that  a  variety  of  properties 
of  frog  falls  can  be  used  by  one  frog  species  to  recognize  the  vocalizations  of  their 
own  species.11  To  identify  different  species  of  frogs,  some  useful  tools  are  used 
to  explore  properties  of  different  frog  calls.  The  spectrogram,  which  is  frequently 
used  in  speech  analysis,  provides  a  three-dimensional  representation  of  the  sound 
intensity  in  different  frequency  bands  over  time  and  thus  can  be  served  as  a  powerful 
tool  for  the  analysis  purpose. 

'  The  spectrogram,  essentially  a  type  of  TDFT,  has  two  general  classes:  wideband 
spectrogram  and  narrowband  spectrogram.  A  wideband  spectrogram  representa¬ 
tion  results  from  a  window  that  is  relatively  short  in  time.  It  has  poor  resolution  in 
frequency-domain  and  good  resolution  in  time-domain.  While  a  narrowband  spec¬ 
trogram  uses  a  longer  window  to  provide  higher  frequency  resolution  and  with  a 
corresponding  decrease  in  time  resolution.12  In  this  research,  single  pitches  of  each 
frog  call  need  to  be  separated  for  identification  and  analysis  purposes.  These  single 
pitches  usually  have  very  short  durations  (10-50  ms).  As  a  result,  wideband  spec¬ 
trogram  which  has  relatively  good  resolution  in  time-domain  is  being  used  here. 
Figure  2  shows  typical  spectrograms  of  the  four  species  being  studied.  All  unique 
call  patterns  are  within  the  rectangle  areas.  Notice  that  different  noises  occupy 
nearly  all  frequency  bands.  Obviously,  each  species  has  unique  properties  to  dis¬ 
tinguish  them  from  each  other.  These  include  the  call  rate,  the  call  duration,  the 
amplitude-time  envelope,  the  waveform  periodicity,  the  pulse-repetition  rate,  the 
frequency  modulation,  the  frequency,  and  spectral  patterns.11  Since  all  frog  call 
signals  occupy  frequency  bands  well  below  4  KHz,  according  to  sampling  theory, 
the  sampling  rate  (Fs)  of  8  KHz  is  a  reasonable  choice.  These  four  species  calls  are 
all  within  certain  frequency  bands.  According  to  call  patterns,  they  can  be  divided 
into  three  types.  Type  I  includes  RAUT  and  PSCL.  Usually,  one  frog  of  these  types 
makes  one  to  several  calls  at  one  tune.  Each  call  is  composed  of  several  pitches 
(pulses)  with  similar  repetition  rate.  PSST  can  be  seen  as  Type  H.  Each  call  of 
PSST  is  only  composed  of  one  single  pitch.  Type  H  can  be  considered  as  the  sim¬ 
pler  case  of  Type  I.  Type  IE  contains  BUAM.  This  call  usually  lasts  for  minutes, 
which  can  be  regarded  as  continuous  calls.  Each  call  of  this  type  is  composed  of 
many  single  pitches  with  similar  repetition  rate.  Table  1  describes  some  typical 
features  of  these  four  different  calls. 

Generally  speaking,  one  single  pitch  is  the  basic  element  of  all  different  frog  calls. 
Though  each  pitch  may  have  one  peak  frequency  value,  basically  the  sound  energy  is 
well  distributed  within  the  whole  frequency  band.  Also,  different  individuals  within 
the  ganne  species  may  have  slightly  different  peak  frequency  value.  According  to 
different  call  patterns  shown  in  Fig.  2,  two  general  methods  can  be  used  to  carry 
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Table  1.  Call  patterns  of  different  species. 


Main  Frequency  Band  Number  of  Pitches 
Species  (Hz)  Within  One  Call 

Pitches  Repetition  Rate 
Within  One  Call  (s) 

Duration  of  One 
Single  Pitch  (s) 

RAUT 

800-2200 

3  ~  9 

0.07  ~  0.09 

0.02  ~  0.03 

BUAM 

1600-2000 

Hundreds 

0.04  ~  0.05 

0.03  ~  0.04 

PSST 

2000-2500 

1 

N/A 

0.04  ~  0.05 

PSCL 

2300-3100 

7  ~  14 

0.02  ~  0.03 

0.01  ~  0.015 

4000 

3500 

3000 

2500 

|j_  2000 

1500 

lOOO 

500 

o 


Fig.  3.  Spectrogram  of  RAUT  call. 


out  species  identification.  One  for  Types  I  and  31  and  the  other  for  Type  HI.  Since 
RAUT  and  PSCL  are  similar  and  PSST  is  simpler  than  these  two,  methods  proposed 
for  detecting  RAUT  will  be  explained  in  detail  as  an  example  arid  methods  for 
identifying  BUAM  will  also  be  illustrated. 


3.3.  The  proposed  filtering  and  grouping  algorithm 

Before  any  identification  work  can  be  carried  out,  characteristics  of  frog  calls  must 
be  carefully  studied.  Then  certain  suitable  features  may  be  extracted  to  implement 
the  species  identification.  Figure  3  showed  some  details  about  the  spectrogram 
of  two  RAUT  calls.  By  looking  at  this  figure,  some  useful  information  may  be 
identified.  This  spectrogram  uses  a  window  of  length  128  and  a  FFT  of  length  256. 
Two  calls  made  by  the  same  frog  are  shown  in  the  shaded  area.  The  frequency  ranges 
from  800  Hz  to  2400  Hz.  Those  outside  the  shaded  area  are  all  different  background 
noises.  In  this  case,  the  noise  level  is  high.  Each  call  is  composed  of  several  pitches, 
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which  can  be  seen  as  vertical  striations  in  the  shaded  area.  The  lengths  of  these 
two  calls  are  T1  and  T2,  respectively.  The  pitch  repetition  rates  within  one  call  are 
roughly  the  same.  That  is,  tl  «  t2  «  i3.  After  carefully  studying  all  available  RAUT 
call  signals,  we  can  make  the  following  observations:  (1)  All  RAUT  calls  have  most 
energy  within  1000—2000  Hz;  (2)  All  RAUT  calls  are  composed  of  several  pitches; 

(3)  The  pitch  repetition  rate  within  one  call  is  approximately  0.07-0.09  second:  and 

(4)  Single  pitch  length  usually  ranges  from  0.02-0.03  second. 

Based  on  these  heuristics,  we  can  use  one  simple  but  effective  algorithm  to  iden¬ 
tify  RAUT  calls  from  the  whole  data  set.  We  choos^  one  period  (roughly  3.5  second) 
of  original  sound  signals  F(t)  as  an  example.  Since  RAUT  calls  are  within  a  cer¬ 
tain  frequency  band,  one  IIR  bandpass  filter  (BPF)  is  used  to  filter  other  irrelevant 
noises  outside  the  specified  frequency  range.  Since  Chebyshev  Type  13  filters  are 
monotonic  in  the  passband  and  equiripple  in  the  stopband,  they  can  largely  sustain 
the  magnitude  of  signal  components  within  the  passband  and  attenuate  unwanted 
signal  components  to  the  same  level,  no  matter  if  they  are  in  high  frequency  band 
or  low  frequency  band.  For  these  justifications,  the  Chebyshev  Type  II  filter  was 
chosen  to  design  the  BPF.  The  desired  passband  is  set  from  1000  Hz  to  2000  Hz. 
Listed  below  axe  parameters  designed  specially  for  this  BPF13: 

(1)  Passband  corner  frequency  Wv :  [1050  Hz  1950  Hz]/(Fs/2)=  [0.2625  0.4875]. 

(2)  Stopband  corner  frequency  Ws :  [950  Hz  2050  Hz]/(Fs/2)=  [0.2375  0.5125]. 

(3)  Passband  ripple  Rp  (the  maximum  permissible  passband  loss  in  decibels):  1  dB. 

(4)  Stopband  attenuation  Rs  (the  number  of  decibels  the  stopband  is  down  from 
the  passband):  50  dB. 

Based  on  these  parameters,  the  lowest  possible  order  of  Chebyshev  Type  H  filter 
is  12  and  the  Chebyshev  Type  H  cutoff  frequency,  Wn,  that  allows  it  to  achieve 
the  given  specifications  is  [0.2472  0.5089].  The  filtered  signal  after  this  BPF  is 
denoted  by  Ybp(t).  Then  Fbp(t)  is  squared  to  get  Y8qr(t).  Using  the  threshold  those 
small  values  of  Kjqr(£)  are  zeroed  out.  The  signal  denoted  by  Tthres(t)  represents 
the  signals  after  thresholding.  The  threshold,  M,  is  set  to  be  one  positive  number 
multiplying  the  mean  value  of  signals  Yaqr(t),  which  is  proportional  to  the  average 
power  of  ibp(t).  In  this  case,  M  is  set  to  be  10  x  mean(3^qr(t)).  In  TthresW,  signals 
below  this  threshold  (small  spikes)  axe  set  to  zero  value  so  as  to  further  reduce 
possible  noise  level. 

Single  pitches  of  RAUT  calls  usually  lasts  0.02  second  to  0.03  second  long  and 
the  pitch  repetition  rate  within  one  call  ranges  from  0.07  second  to  0.09  second. 
The  grouping  algorithm  is  trying  to  identify  those  spikes  that  tend  to  belong  to 
one  single  call  and  group  them  together.  Thus,  the  time  intervals  of  one  single 
call  can  be  detected.  First,  by  gathering  those  signals  very  closely  to  each  other 
together,  the  time  intervals  of  those  possible  isolated  spikes  (pitches)  are  detected. 
Then  by  checking  the  duration  of  each  short  time  interval,  those  with  too  short 
or  too  long  of  intervals  that  obviously  are  not  single  pitches  are  thrown  away.  The 
second  step  is  to  group  those  pitches  that  tend  to  belong  to  one  call  together  by 
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Fig.  4.  Two  intervals  detected  by  grouping  algorithm. 


checking  the  intervals  between  adjacent  pitches.  Thus,  possible  time  intervals  of 
single  frog  calls  are  acquired.  Pitches  too  far  away  cannot  be  clustered  together 
since  they  tend  to  belong  to  different  calls.  Finally  the  lengths  of  possible  single 
calls  axe  checked.  Since  one  RAUT  call  is  often  composed  of  at  least  three  strong 
isolated  pitches,  those  intervals,  which  are  not  long  enough,  are  discarded.  After 
these  three  steps,  real  RAUT  call  intervals  have  finally  been  detected  and  isolated. 
Figure  4  shows  the  result  of  two  calls  along  with  several  pitches  separated  by  the 
proposed  grouping  algorithm.  The  first  call  contains  eight  pitches.  It  starts  from 
point  t  =  4861  and  ends  at  point  t  =  9261.  So  the  duration  of  this  call  is  0.55  second 
((9261  —  4861)/8000  =  0.55).  The  second  one  contains  six  pitches.  It  starts  from 
t  =  10  686  and  ends  at  t  =  14013.  The  duration  of  this  call  is  0.416  second.  The 
intervals  of  these  single  pitches  have  also  been  stored  for  possible  further  analysis 
such  as  individual  identification.  In  Fig.  4,  the  darker  the  shaded  area,  the  stronger 
the  single. pitch  is.  -  - 

Compared  to  other  species  identification  methods  proposed  for  marine  mammal 
like  whales  and  dolphins,  14~16  we  propose  herein  a  simpler  algorithm,  which  requires 
much  less  computation.  For  the  incoming  N- point  data  set,  the  computational 
complexity  of  the  proposed  filtering  and  grouping  algorithm  is  at  0(PN).  While 
all  methods  listed  in  literature  require  further  processing  of  data  with  additional 
pattern  classification.  The  method  proposed  here  does  not  need  this  step  at  all.  As 
mentioned  before,  calls  of  Type  II  (PSST)  are  the  simpler  form  of  those  of  Type  I. 
So  just  by  changing  some  parameters,  this  filtering  and  grouping  algorithm  may 
also  be  used  to  identify  Type  II  frog  calls. 

Since  many  species  of  frogs  tend  to  make  sounds  in  a  similar  pitch/pulse  repe¬ 
tition  mode,  it  is  convenient  to  use  this  filtering  algorithm  to  identify  these  species. 


Automatic  Frog  Calls  Monitoring  System:  A  Machine  Learning  Approach  175 


In  doing  so,  parameters  such  as  passband  of  BPF  and  some  criteria  about  pitch 
duration  and  call  length  need  to  be  changed  accordingly.  There  is  no  need  to  use  the 
BPF  and  grouping  algorithm  to  identify  species  with  continuous  calls  like  BUAM. 
Fast  Fourier  transform  (FFT)  is  enough  to  accomplish  this  task.  The  main  fre¬ 
quency  band  of  BUAM  calls  is  [1600  Hz  2000  Hz].  Usually,  peak  frequencies  of 
BUAM  calls  are  within  a  range  of  [1650  Hz  1850  Hz].  The  pitch  repetition  rate  is 
roughly  0.04  0.05  second,  that  is  320  ~  400  points  and  the  duration  of  one  single 

pitch  is  usually  0.03  ~  0.04  second,  that  is  240  ~  320  points.  So  a  512-point  time 
interval  at  anytime  within  a  call  is  sufficient  to  cover  most  parts  of  one  single  pitch. 
Specifically,  a  512-point  data  set  was  extracted  every  two  seconds  along  the  data 
file.  For  one  data  file  approximately  10-second  long,  a  total  of  four  or  five  data  sets 
can  be  extracted.  FFT  was  performed  on  each  data  set.  Then  peak  frequency  values 
within  [1000  Hz  2000  Hz]  of  each  result  were  checked  to  see  whether  they  are  still 
within  [1650  Hz  1850  Hz].  If  so,  there  must  be  BUAM  calling  in  this  period.  Ideally, 
this  method  is  reliable  at  detecting  continuous  calls  that  have  peak  frequency  val¬ 
ues  within  a  narrow  frequency  band.  But  if  there  is  some  kind  of  continuous  noise, 
which  happens  to  have  the  peak  frequency  value  in  the  same  frequency  range,  a 
mismatch  is  unavoidable. 


4.  Frog  Individual  Identification 
4.1.  Individual  identification  overview 

Individual  identification  refers  to  identifying  individual  frogs  within  the  same 
species  and  estimating  the  number  of  the  identified  frogs.  Not  all  species  of  frogs  are 
available  for  this  task.  The  underlying  assumption  of  automatic  individual  identifi¬ 
cation  is  that  human  experts  may  distinguish  different  individuals  within  the  same 
species.  If  so,  their  knowledge  can  be  used  for  analysis  purposes  and  make  the  auto¬ 
matic  identification  possible.  If  human  experts  cannot  tell  the  difference  of  one  call 
from  another,  it  is  unlikely  for  a  machine  to  tell  the  difference  without  a  guideline 
For  example,  one  BUAM  frog  usually  makes  calls  continuously  for  several  tens  of 
seconds  or  even  minutes.  There  may  or  may  not  be  other  individual  calls  at  the 
same  time.  Even  an  expert  cannot  tell  the  difference.  In  this  case  without  any  spe¬ 
cific  sample  or  prior  knowledge,  no  baseline  can  be  established.  Within  the  four 
species  of  interest,  only  RAUT  can  be  identified  by  human  ear  individually,  so  our 
research  will  focus  on  individual  identification  within  this  species  only. 

Before  individual  identification  can  be  examined,  typical  samples  of  individual 
calls  must  be  collected  first.  In  the  species  identification  stage,  individual  RAUT 
calls  with  several  pitches  have  been  extracted  as  seen  in  Fig.  4.  Since  one  pitch  is 
the  basic  element  of  each  RAUT  call,  we  may  choose  one  typical  pitch  as  the  sample 
for  one  RAUT  call.  The  duration  of  one  single  pitch  ranges  from  0.02  second  to 
0.03  second,  that  is  160  ~  240-point  data  segment.  A  finite  duration  window  w\ni\  is 
applied  to  original  signal  Y[n]  prior  to  any  signal  analysis.  This  produces  the  finite 
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length  sequence  «[n]  =  w[n]Y[n}.  There  are  many  kinds  of  windows  in  digital  signal 
processing  such  as  Bartlett,  Hamming,  Harming,  Blackman,  Kaiser,  and  rectangular 
window.  Here,  we  choose  one  non-overlapping  512-point  Hamming  window.  The 
center  point  of  the  Hamming  window  was  chosen  to  be  at  the  maximum  filtered 
value  (ybp(t))  within  one  call.  Thus,  the  strongest  pitch  within  one  call  has  been 
extracted  and  serves  as  the  sample  vector  for  this  frog  call.  For  each  RAUT  call, 
we  can  extract  one  512-point  data  segment  as  its  sample  vector.  All  future  works 
axe  based  on  analysis  of  this  data  segment. 

4.2.  Feature  extraction  algorithm 

Feature  extraction  involves  preliminary  processing  of  signals  to  obtain  suitable  pa¬ 
rameters  that  reveal  distinguishable  nature  of  the  specific  kind  of  signal.  The  aim 
of  feature  extraction  is  to  devise  a  transformation  that  extracts  the  signal  features 
hidden  in  the  original  domain.  Corresponding  to  different  characteristics  of  signals, 
different  transformations  should  be  properly  selected  to  extract  those  most  typical 
features  from  the  original  signal  domain,  thus  to  make  the  following  step  of  signal 
analysis,  which,  in  this  case,  is  the  pattern  classification,  much  easier. 

Many  signals  require  a  more  flexible  approach  —  one  where  we  can  vary  the  win¬ 
dow  size  to  determine  features  of  these  signals  more  accurately  either  in  time  or  in 
frequency.  The  Wavelet  analysis  employs  a  windowing  technique  with  variable-sized 
regions.  Wavelet  analysis  allows  the  use  of  long  time  intervals  where  we  want  more 
precise  low  frequency  information  and  short  regions  where  we  want  high  frequency 
information.  Wavelet  analysis  does  not  use  a  time-frequency  region,  but  rather  a 
time-scale  domain.  It  is  capable  of  revealing  aspects  of  data  that  other  signal  analy¬ 
sis  techniques  miss,  aspects  such  as  trends,  breakdown  points,  and  discontinuities  in 
higher  derivatives,  and  self-similarity.  Unlike  the  Wavelet  Transform,  which  decom¬ 
poses  only  low  frequency  components  of  a  signal;  Wavelet  Packet  Transform  (WPT) 
decomposes  both  the  low  frequency  and  high  frequency  components.  This  rich  abun¬ 
dant  information  with  arbitrary  time-frequency  resolution  can  allow  extraction  of 
features  that  combines  both  stationary  and  non-stationary  characteristics.  How¬ 
ever,  one  deficiency  that  wavelet  bases  inherently  possess  is  the  lack  of  translation 
invariant  property.  A  signal  with  a  time  shift  does  not  result  in  the  time  shifted 
wavelet  packet  coefficients.  Direct  assessment  from  all  wavelet  packet  coefficients 
often  turns  out  to  be  tedious  or  leads  to  inaccurate  results.  Here,  we  introduce  the 
idea  of  wavelet  packet  node  energy.17  Each  cell  wid  refers  to  one  node  in  the  de¬ 
composition  tree,  here  i  is  the  scaling  parameter  and  j  is  the  oscillation  parameter. 
We  call  each  (i,j)  as  a  wavelet  packet  node.  Each  (i,j)  has  K  coefficients, 

For  one  signal  of  length  2N ,  the  maximum  value  of  i  is  N,  for  each  i,  the  maximum 
value  of  j  is  2’  —  1  and  the  maximum  value  of  A;  is  2N~l. 

The  wavelet  packet  node  energy  is  defined  as: 

K 

fc=l 


(4.1) 
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which  measures  the  signed  energy  contained  in  some  specific  frequency  band  indexed 
by  parameters  i  and  j.  In  our  case,  each  wavelet  packet  node  energy  value  was 
defined  as  an  individual  feature  component  and  was  used  as  a  robust  rudimentary 
exploration  of  the  specific  signal  features.  For  an  r  level  decomposition,  we  can  get 
a  total  of  21  *4*  22  +  *  •  *  +  2r  =  2r+1  -  2  sets  of  node  energy  coefficients.  These 
coefficients  are  final  results  extracted  from  each  frog  call  signal  by  using  wavelet 
method. 


4.3.  Dimension  reduction/ feature  selection  algorithm 

Notice  using  WPT  may  produce  a  high  dimensional  feature  vector  for  each  in¬ 
dividual  frog  call  signal.  Direct  manipulation  on  a  whole  data  set  is  not  feasible 
because  of  high  dimensionality  of  data  and  the  existence  of  undesired  components 
that  make  the  classification  unnecessarily  complicated.  Thus,  it  is  desirable  to  use 
lower  dimensional  feature  vectors  as  input  for  the  pattern  classifier.  To  reduce  di¬ 
mension  of  feature  vectors,  one  idea  is  to  find  a  linear  transformation  that  maps 
high  dimensional  data  onto  lower  dimensional  space,  the  other  is  to  select  those  fea¬ 
ture  components  that  contain  most  discriminant  information  and  discard  those  that 
provide  little  information  which  is  useful  for  classification  purpose.  Here,  the  sec¬ 
ond  method  is  chosen  for  the  dimension  reduction  purpose.  Specifically,  the  feature 
component  =  1, 2 . . . ,  n}  is  ranked  by 

J(fi)  >  J(/a)  >  >  Afd)  >  ■  ■ •  ■  >  J(fn)  -  (4-2) 


where  J(-)  is  a  criterion  function  for  measuring  the  discriminant  power  of  a  specific 
feature  component,  /*,.  Feature  subsets  can  be  chosen  from  those  features  having 
larger  criterion  function  values. 

In  this  study,  a  simple  while  efficient  criterion  function  known  as  Fisher’s 
criterion17  was  adopted.  For  a  two  classes  problem,  it  is  given  by 


t  /_•  -*\  1 2 

Jfk{h3)=  r2  —  72 - 

%/fc  +  0jJk 


where  fiijk  and  fijjk  are  the  mean  values  of  the  fcth  feature,  /*.,  for  class  i  and  j;  6fjk 
and  SL„  are  the  variance  of  the  fcth  feature,  /*,  for  class  i  and  j  correspondingly. 
For  multiple  classes  (class  number  equals  to  L)  case,  the  general  approach  is  to  take 
summation  of  the  pairwise  combinations  of  j)' 


jfk  =  12  ■*/*(*»  j)  (4-4) 

i=l  j=i+l 

as  an  estimation  of  discriminant  power  for  the  specific  feature  /&.  Equation  (3) 
provides  a  measure  to  evaluate  the  effectiveness  of  the  “global”  feature  that  is 
simultaneously  suitable  to  differentiate  all  classes  of  signals.  For  small  classes  case, 
this  approach  may  be  sufficient.  When  the  number  of  classes  increases,  this  equation 
becomes  more  ambiguous.  A  large  value  of  Jfk  may  be  due  to  the  accumulations 
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of  many  relatively  small  values  (an  unfavorable  case)  or  to  a  few  significant  terms 
with  negligible  majority  (a  favorable  case).  Also  a  feature  with  large  Jfk  (i,j)  value 
to  class  i  and  j  may  have  very  small  discrimination  power  for  other  classes,  thus 
this  makes  Jjk  also  very  small.  To  avoid  these  problems,  two  methods  are  proposed 
as  possible  alternatives. 

Method  I.  Instead  of  trying  to  select  features,  which  are  effective  for  the  entire 
multi-class  problem  globally  as  measured  by  Eq.  (4),  a  feature  subset  based  on 
Eq.  (3)  was  selected  for  each  possible  pair  of  classes.  Then  the  union  of  feature 
components  selected  from  each  pair  of  classes  was  taken  to  form  the  final  feature 
vector.  Specifically,  given  a  L-class  problem  with  n  feature  components,  the  process 
is  detailed  in  the  following  steps: 

(1)  For  each  possible  class  pair  {(i,  j)\i  =  1, 2, . . . ,  L  —  1,  j  =  i  +  1,  i  4-  2, ,  L}, 
calculate  the  discriminant  power  measure  for  each  feature  component,  /*;,  using 
Eq.  (4.3). 

(2)  For  each  class  pair,  sort  Jfk(i,j )  such  that: 

Jfi  (i,j)>  Jh (*> j)  >•••>//„ (i,i)  >"•  >  Jf„ {i,  j)  -  (4-5) 

Determine  the  feature  subset  Fij  for  each  class  pair  by  selecting  d  feature 
components  that  have  maximum  j)  value: 

Fij  =  {fk\k  =  1,2,. i  =  1,2,..., L  —  1;  j  =  i  +  M  +  2,...,L.  (4.6) 

3.  Form  the  final  feature  set  by  taking  the  union  of  each  feature  subset. 

L- 1  L 

•Penal  =  U  U  FU  (4‘7) 

i—1  j=i+l 

Method  II.  This  method  is  based  on  a  similar  idea  of  Method  I.  Compared  to 
Method  I,  Method  II  may  choose  a  different  number  of  feature  components  from 
each  class  pair,  thus  more  reasonable  feature  components  are  expected  with  this 
method.  The  first  step  is  the  same  as  that  of  Method  L  In  the  second  step,  after 
Jfk  (i,  j)  were  sorted  in  descending  order,  the  whole  data  set  was  normalized.  That  is: 

j=EjaW)  <4-8) 

fc=i 

J'sk^j)  =  — k  =  1.2, ...,n.  (4.9) 

Set  one  threshold  value  H  6  (0,1].  Determine  the  feature  subset  Fij  for  each  class 
pair  by  selecting  D  feature  components  that  have  maximum  Jfk(hj)  value.  D  must 
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satisfy: 


(  D 


fc=l 

S  <  H 


(4.10) 


V 


Fi,j  =  {/fc|fe  =  1,2,...,!?},  *  =  1,2, —  1;  J=i  +  1.  i  +  2,...,L.  (4.11) 


The  third  step  is  the  same  as  that  of  Method  I.  By  using  Method  IL  different 
numbers  of  feature  components  may  be  extracted  from  each  class  pair.  If  two  classes 
are  well  separable,  there  must  exist  a  few  feature  components  that  contain  much 
larger  Jjk(i,j)  values  than  most  other  feature  components.  If  two  classes  are  not 
well  separable,  then  most  feature  components  tend  to  have  similar  relatively  small 
Jfk(i,j)  values.  By  setting  this  threshold,  we  may  choose  fewer  feature  components 
from  well  separable  class  pairs  and  more  feature  components  from  those  classes 
that  were  difficult  to  separate.  Thus,  we  may  get  feature  subsets  with  relatively  low 
dimension  while  still  containing  high  discriminant  power. 


4.4.  Pattern  classification  algorithm 

Once  after  suitable  feature  components  have  been  extracted  from  the  original,  fea¬ 
ture  set,  it  is  then  necessary  to  determine  individual  frogs  based  upon  these  features. 
Neural  network  and  a  variety  of  multivariate  statistical  methods  have  been  used  for 
pattern  classification  problems.  Neural  network  classifiers  are  widely  used  because 
they  are  universal  function  approximators  and  because  of  their  nonlinear  nature, 
they  have  the  ability  to  capture  the  underlying  non-linearity  from  the  incoming 
data.  However,  for  Multi-layer  Perception  (MLP)  network,  existing  patterns  must 
be  used  to  train  the  network  and  the  classifier  can  only  detect  those  already  existed 
classes.  That  is,  in  our  case,  a  priori  knowledge  for  each  individuals  of  RAU I  in  one 
area  is  given.  If  one  new  RAUT  frog  makes  a  call  and  its  call  features  are  fed  into 
the  classifier,  the  MLP  classifier  may  not  identify  it  as  a  new  one  and  possibly  will 
classify  it  into  one  already  existing  RAUT  frog  class.  As  a  result,  the  classifier  must 
be  built  in  with  on-line  learning  ability  that  can  learn  new  patterns  in  real-time 
and  grow  continuously  with  the  number  of  identified  individual  numbers  without 
re-training.  MLP  is  clearly  defeated  by  this  specification.  The  Incremental  Learn¬ 
ing  Fuzzy  Neural  Network  (ILFN)  developed  in  Refs.  18  and  19  can  address  this 
deficiency.  It  uses  an  incremental  learning  algorithm  and  can  detect  new  classes  of 
patterns  and  update  its  parameters  while  in  an  operating  mode.  And  it  has  an  on¬ 
line  (real-time)  and  fast  learning  algorithm  without  knowing  a  priori  information. 
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5.  Test  Results 

In  this  section,  natural  frog  calls  collected  via  Telinga  Pro  V  Mono  Parabolic  micro 
phone  mounted  at  various  lakesides  in  Stillwater,  Oklahoma  were  used  to  validate 
the  feasibility  of  the  proposed  species  and  individual  identification  methods. 

5.1.  Results  for  species  identification 

For  species  identification,  one  DAT  with  a  total  length  of  50  minutes  was  chosen 
as  the  sample.  It  contains  frog  calls  of  all  four  species  obtained  from  several  lake- 
sides  nearby.  Each  species  contains  several  different  individuals.  The  entire  DAT 
data  were  manually  saved  into  the  PC  in  WAVE  format.  Each  file  segment  was 
approximately  10  seconds  long.  Each  data  set  was  fed  into  four  different  programs 
that  identify  one  species  correspondingly.  The  goal  is  that  each  program  may  be 
able  to  identify  all  clear  calls  of  that  species  and  does  not  miscount  other  signals 
(e.g.  calls  made  by  other  species).  It  is  much  more  desirable  for  our  system  to  fail 
to  recognize  a  call  (a  false  negative)  than  to  incorrectly  indicate  the  call  of  a  par¬ 
ticular  species  is  present  (a  false  positive).  It  is  crucial  then  to  choose  parameters 
such  that  false  positives  are  minimal.  For  example,  to  identify  RAUT  and  PSCL 
frogs,  according  to  call  properties  generalized  in  Sec.  3,  we  narrow  down  the  ranges 
of  pitch  duration  thus  to  avoid  mismatch  with  other  short  duration  spikes.  Also, 
in  the  clustering  algorithm,  to  choose  possible  call  signals  and  discard  the  false 
impulses  we  do  thresholding  on  the  squared  signal.  If  the  threshold  value  is  too  big, 
more  irrelevant  signals  will  be  discarded  but  some  portion  of  true  frog  call  signals 
(those  pitches  in  the  beginning  or  in  the  end  of  one  call)  will  also  be  thrown  away 
due  to  little  energy  they  have.  There  exists  a  trade-off  between  recognizing  more 
false  negative  and  less  false  positive.  In  practice,  by  carefully  adjusting  various  pa¬ 
rameter  values,  the  result  for  species  identification  is  quite  promising.  Within  this 
sample  period,  there  are  hundreds  of  calls  belonging  to  four  different  species.  Ex¬ 
cept  for  some  weak  calls  and  some  calls  obscured  by  environmental  noise,  most  clear 
calls  can  be  detected  and  identified  as  belonging  to  correct  class  with  nearly  100% 
accuracy.  For  frog  species  of  BUAM  and  PSST,  the  results  are  found  to  be  perfect. 

Yet,  there  do  exist  a  few  mismatches  when  identifying  species  RAUT  and  PSCL. 
The  noise  causes  part  of  the  problem.  In  most  natural  situations,  background  noise 
is  extremely  high  and  its  temporal  and  spectral  structure  are  complex  and  variable. 
In  this  DAT  tape,  there  always  exists  three  types  of  noise: 

(1)  Noises  made  by  other  living  creatures:  including  calls  made  by  other  frog  species 
and  some  insects  like  crickets.  Occasionally  there  exists  some  dog  barks  and 
human  speech  if  the  pond  is  close  to  human  community. 

(2)  Noises  made  by  natural  phenomena:  including  wind  noise  and  rain  noise. 

(3)  Noises  made  by  vehicles:  including  noises  made  by  automobiles. 

Among  these  three  types  of  noises,  1  and  2  occur  more  frequently.  If  the  fre¬ 
quency  band  of  these  noises  is  different  from  that  of  the  specific  frog  species,  they 
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can  be  removed  in  the  stage  of  filtering.  Or  if  the  spectrogram  of  these  noises  did 
not  appear  to  be  a  steady  pulse  repetition  mode  just  like  those  of  RAUT  and  PSCL, 
they  can  also  be  eliminated  in  the  stage  of  clustering.  But  if  the  main  frequency 
band  and  pulse  shape  of  that  kind  of  noise  are  quite  similar  to  those  of  frog  species, 
a  mismatch  is  inevitable  by  using  the  proposed  identification  method.  For  the  third 
type  of  noise,  although  it  happens  occasionally,  if  the  noise  level  is  high,  frog  calls 
may  be  occluded  and  sometimes  mismatch  may  also  occur. 

5.2.  Results  for  individual  identification 

5.2.1.  Data  segmentation 

Individual  identification  is  based  on  the  results  of  species  identification.  After  one 
call  of  RAUT  species  has  been  identified,  its  calling  period  has  also  been  determined 
simultaneously.  Then  a  non-overlapping  512-point  Hamming  window  was  used  to 
extract  a  512-point  time  series  data  segment  from  one  RAUT  call  as  its  sample 
vector.  The  length  of  window  guarantees  to  contain  at  least  one  pitch  (the  strongest 
one)  within  this  sample  vector.  For  the  same  species,  it  is  reasonable  to  assume  that 
call  patterns  of  different  individuals  can  be  fully  explored  by  analyzing  one  single 
pitch. 

Before  this  data  segment  can  be  used  for  further  analysis,  the  mean  value  of  this 
segment  is  calculated  first  and  subtracted  from  the  whole  data  set.  This  is  because 
signals  with  non-zero  mean  may  produce  incorrect  spectrum  estimates  especially  in 
low  frequency  band.  Subtracting  mean  value  from  the  signal  often  leads  to  a  better 
estimate  at  neighboring  frequencies. 

5.2.2.  Generation  of  training /testing  data  set 

Because  frogs  are  sensitive  to  sudden  changes  of  environment,  their  calls  are  difficult 
to  collect.  Also  human  experts  usually  have  limited  capability  to  identify  different 
individuals.  For  these  reasons  there  are  a  total  of  66  data  sets  that  have  been 
identified,  which  correspond  to  four  different  individual  RAUT  frogs.  For  these 
36  data  sets,  each  time  44  data  sets  were  randomly  chosen  as  training  data  while 
the  remaining  22  data  sets  were  used  for  testing.  The  distribution  of  training  and 
testing  data  sets  axe  shown  in  Table  2. 

5.2.3.  System  description 

After  66  of  512-point  feature  vectors  are  extracted,  Wavelet  Packet  Transform 
(WPT)  is  used  for  feature  extraction.  In  addition,  Linear  Predictive  Coding  (LPC), 
and  Time-Dependent  Fourier  Transform  (TDFT)  are  used  for  comparison.  For 
TDFT  and  WPT,  two  different  dimension  reduction  algorithms  axe  used  to  de¬ 
rive  the  final  feature  vector,  which  will  then  be  fed  into  a  neural  network  classifier. 
For  fairness  of  the  comparison,  MLP  is  used  here  as  a  neural  classifier.  It  was 
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Table  2.  Number  of  individual  samples  and  the  distribution  of  training/ testing  sets. 


RAUT  Individual 

Total  Number  of 
Data  Sets 

#  of  Data  Sets 
for  Training 

#  of  Data  Sets 
for  Testing 

Frog  1 

16 

11 

5 

Frog  2 

20 

13 

7 

Frog  3 

17 

11 

6 

Frog  4 

13 

9 

4 

found  in  Ref.  18  that  1LFN  will  outperform  the  MLP  in  all  classification  tasks.  The 
steps  summarized  below  provided  the  details  for  each  feature  extraction  algorithm 
considered. 

LPC: 

(1)  Determine  number  of  LPC  coefficient,  p,  according  to  mean  square  error  (MSE) 
between  filtered  values  and  actual  values.  It  was  found  at  p  ~  12  that  the  MSE 
is  acceptable. 

(2)  For  each  sample  vector,  determine  16  time  domain  LPC  filter  coefficients. 

(3)  Calculate  FFT  of  16-point  LPC  coefficients  and  obtain  nine  unique  spectral 
magnitudes. 

(4)  Normalize  these  nine  spectral  magnitudes  to  derive  final  feature  vector. 
TDFT: 

(1)  Calculate  512-point  FFT  for  each  windowed  data  set  and  obtain  257-point 
spectral  magnitude  vector. 

(2)  Use  feature  reduction  Methods  I  and  II,  set  parameters  d  (defined  in  Method  I) 
and  H  (defined  in  Method  II)  and  derive  the  corresponding  feature  subsets. 

(3)  Normalize  these  two  feature  subsets  and  acquire  the  final  feature  vector. 

WPT: 

(1)  Perform  eight-level  wavelet  packet  decomposition  for  each  512-point  data  set 
by  using  Daubechies  eight-point  wavelet  function. 

(2)  Calculate  wavelet  packet  node  energy  according  to  Eq.  (4.1)  and  obtain  the 
510-point  feature  vector. 

(3)  Use  feature  reduction  Methods  I  and  II  to  derive  the  corresponding  feature 
subsets. 

(4)  Normalize  these  two  feature  subsets  and  obtain  the  final  feature  vector. 

5.2.4.  Test  results 

After  training  data  were  obtained  by  three  feature  extraction  algorithms,  LPC, 
TDFT  and  WPT,  as  well  as  two  feature  reduction  algorithms,  Methods  I  and  II, 
they  were  fed  into  a  MLP  neural  classifier.  By  checking  the  final  results  of  the 


ii 
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classifier  to  those  testing  data  sets,  some  conclusions  may  be  drawn  pertaining  to 
which  method  is  better  and  which  one  is  not.  For  dimension  reduction  Methods  I 
and  n,  parameter  d  and  H  should  be  carefully  chosen  so  that  reduced  feature  vectors 
with  same  or  similar  dimension  may  be  generated.  There  are  a  total  of  66  data  sets 
available.  In  each  test,  44  of  them  were  randomly  chosen  as  training  sets  while  the 
remaining  22  were  used  as  testing  sets.  This  process  is  repeated  1000  times.  The 
number  of  training  sets  and  testing  sets  within  one  class  (calls  produced  by  the 
camp  frog)  are  fixed,  as  seen  in  Table  2.  The  mean  value  of  these  1000  simulations 
was  calculated  as  indicator  for  test  performance.  The  variance  is  also  computed.  If 
all  variances  are  not  high  and  in  the  same  level,  then  the  performance  of  the  whole 
system  can  be  regarded  as  stable.  The  network  architectures  are  N-N-4  (with  N 
neurons  in  the  only  hidden  layer)  and  N-10-10-4  (with  10  neurons  in  the  first  and 
the  second  hidden  layers) ,  where  N  is  the  dimension  of  the  final  feature  vector.  In  the 
learning  phase,  the  network  is  trained  until  the  mean  square  error  is  below  0.001, 
or  the  maximum  epochs  (set  to  1000)  is  reached.  The  resilient  backpropagation 
algorithm20  is  used  to  train  the  network.  In  training  we  can  make  the  desired 
output  of  MLP  to  be  a  perfect  decision,  i.e.  one  1  and  three  0  second.  But  the 
classifier  will  not  produce  such  perfect  decision  in  the  testing  process.  Usually  each 
output  will  be  between  0  and  1.  Here,  we  use  the  maximum  output  value  as  the 
most  likely  individual  frog.  In  all  cases,  a  clear,  winner  can  always  be  identified.  The 
classification  results  are  shown  in  Tables  3-5.  Mean  is  referred  to  as  mean  accuracy. 
It  has  a  range  from  0  to  1.  Var.  is  referred  to  as  variance  of  the  1000  runs.  The 
training  accuracy  is  always  100%  in  all  cases  and  the  corresponding  variance  is  0. 
So  these  tables  only  show  test  results  for  different  methods. 

First,  examine  the  results  of  LPC  method.  The  classifier  can  only  correctly 
classify  roughly  half  of  the  test  samples,  which  is  not  good  and  much  lower  than 
the  results  of  TDFT  and  WPT  methods.  It  can  be  noticed  that  these  samples 
contain  a  large  amount  of  noise.  A  rough  estimate  to  some  data  files  shows  an 
average  SNR  (signal  to  noise  ratio)  of  -3  dB.  The  noise  significantly  deteriorates 
the  performance  of  LPC  filter  and  finally  leads  to  poor  performance  of  the  neural 
network  classifier. 

On  the  average,  neural  network  classifiers  based  on  WPT  method  acquires  the 
accuracy  of  classification  8%  higher  than  those  based  on  TDFT  method.  A  Wavelet 
based  method  provides  a  better  time-frequency  resolution  and  they  are  more  effi¬ 
cient  than  Fourier  based  methods  for  non- stationary  signal  analysis.  To  compare 
the  performance  of  two  feature  reduction  methods  I  and  II,  some  conditions  have 


Table  3.  Test  Results  for  LPC  method. 


LPC 

N  =  9 

N-N-4 

Mean 

0.5068 

Var. 

0.0X04 
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Table  4.  Test  results  for  TDFT  method. 


TDFT 

N 

N-N-4 

N-10-10-4 

Method  I 
d  =  A 

17 

Mean 

Var. 

0.6082 

0.0075 

0.6089 

0.0099 

Method  II 

H  =  0.1 

18 

Mean 

Var. 

0.6218 

0.0115 

0.6188 

0.0089 

Method  I 

d  =  8 

33 

Mean 

Var. 

0.6471 

0.0093 

0.6505 

0.0098 

Method  II 

H  =  0.16 

33 

Mean 

Var. 

0.6330 

0.0094 

0.6377 

0.0093 

Table  5. 

Test  results  for  WPT  method. 

WPT 

N 

N-N-4 

N-10-10-4 

Method  I 

d  =  5 

19 

Mean 

Var. 

0.6827 

0.0097 

0.6809 

0.0105 

Method  II 

H  =  0.06 

18 

Mean 

Var.  . 

0.7118 

0.0068 

0.7164 

0.0100 

Method  I 
d  =  8 

31 

Mean 

Var. 

0.6609 

0.0095 

0.6891 

0.0102 

Method  II 

H  —  0.1 

33 

Mean 

Var. 

0.7218 

0.0076 

0.6955 

0.0150 

been  set.  Basically,  when  the  dimension  of  final  feature  vectors  is  the  same  or  quite 
similar,  using  Method  II  may  extract  feature  components  with  more  discriminant 
power  thus  to  make  the  performance  of  neural  network  classifier  better.  This  is  espe¬ 
cially  true  in  WPT  case,  in  which  by  using  Method  II  the  accuracy  of  the  classifier 
is  on  the  average  3%  higher  than  that  of  using  Method  I.  This  also  substantiates 
our  assumption  that  we  may  choose  fewer  feature  components  to  distinguish  those 
easily  separable  classes  and  choose  more  feature  components  to  distinguish  those 
relatively  not  so  easily  separable  classes.  By  this  way,  more  feature  components  that 
contain  the  most  discriminant  power  may  be  included  with  limited  feature  vector 
dimension.  It  is  observed  that  WPT  method  exerts  a  large  amount  of  computation 
load  compared  to  TDFT  method.  If  WPT  method  is  to  be  used,  it  is  preferred 
to  use  low  dimensional  feature  vector  and  use  simple  neural  network  structure. 
Among  all  these  combinations,  one  good  solution  can  be  found.  That  is,  use  WPT 
as  feature  extraction  algorithm  with  feature  reduction  Method  II  (set  H  =  0.06), 
get  18-point  feature  vectors,  then  use  18-18-4  MLP  as  the  neural  classifier.  Thus  a 
considerable  amount  of  computation  is  avoided,  while  the  accuracy  for  classification 
remains  high. 
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6.  Conclusion 

This  research  has  investigated  the  feasibility  of  building  an  automatic  frog  call 
monitoring  system  based  on  in-field  acquisition  of  sound  signals.  The  frog  species 
identification  has  been  realized  in  the  first  stage.  Different  algorithms  including  fil¬ 
tering  and  grouping  are  developed  to  identify  different  species.  The  individual  frog 
identification  of  species  RAUT  has  been  performed  in  the  second  stage.  Since  most 
of  the  researches  in  the  field  of  animal  sound  recognition  are  focused  on  species 
identification,  the  individual  identification  approach  proposed  herein  is  novel.  Fea¬ 
ture  extraction  algorithm  using  "WPT  with  two  dimensionality  reduction  algorithms 
(Methods  I  and  II),  and  the  neural  network  classifier  have  been  synergistically  in¬ 
tegrated  together  to  facilitate  the  estimation  of  the  population  within  the  species 
of  interest. 
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Abstract 

Gas-liquid  two-phase  flows  are  widely  used  in  the  chemical  industry.  Accurate  measurements  of 
flow  parameters,  such  as  flow  regimes,  are  the  key  of  operating  efficiency.  Due  to  the  interface 
complexity  of  a  two-phase  flow,  it  is  very  difficult  to  monitor  and  distinguish  flow  regimes  on¬ 
line  and  real-time.  In  this  paper  we  propose  a  cost-effective  and  computation-efficient  AE 
detection  system  combined  with  artificial  neural  network  technology  to  recognize  four  major 
patterns  in  an  air-water  vertical  two-phase  flow  column.  Several  crucial  AE  parameters  are 
explored  and  validated,  and  we  found  that  the  density  of  acoustic  emission  events  and  ring-down 
counts  are  two  excellent  indicators  for  the  flow  pattern  recognition  problems.  Instead  of  the 
traditional  Fair  map,  a  hit-count  map  is  developed  and  a  multi-layer  Perceptron  neural  network  is 
designed  as  a  decision-maker  to  describe  an  approximate  transmission  stage  of  a  given  two- 
phase  flow  system. 

Keywords:  Acoustic  emission,  process  monitoring,  non-destructive  testing,  artificial  neural 
network 
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with  AE  methods  was  applied  to  detect  and  classify  four  major  regimes,  namely  Bubbly,  Slug, 
Chum  and  Annular  of  a  vertical  air-water  two-phase  flow. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  investigates  four  significant 
flow  regimes.  The  well-regarded  Fair  regime  map  is  introduced  to  describe  traditional  regime 
classification  technique.  Section  3  discusses  the  principle  of  acoustic  emission  techniques  and 
defines  some  important  parameters  for  the  experiment.  Section  4  proposes  our  AE  based  air- 
water  two-phase  flow  regime  classification  system.  System  hardware  configuration  is  illustrated, 
and  a  multiplayer  Perceptron  (MLP)  neural  network  is  designed  to  locate  the  detected  signal  at 
the  correct  position  on  the  AE  Hit-Count  map.  Section  5  presents  the  experimental  result  by  our 
designed  system.  Section  6  provides  some  concluding  remarks  along  with  pertinent  observations. 

2.  FLOW  REGIMES  OF  A  GAS-LIQUID  TWO-PHASE  COLUMN 

The  description  of  a  two-phase  flow  in  pipes  is  highly  intricate  due  to  the  various  existence 
of  the  interface  between  the  two  phases.  For  gas-liquid  two-phase  flows,  the  variety  of  interface 
forms  depends  on  the  flow  rates,  phase  properties  of  the  fluid  and  on  the  inclination  and  the 
geometry  of  the  tube.  Generally,  for  vertical  gas-liquid  two-phase  flows,  the  flow  regimes  are 
mainly  determined  by  the  phase  flow  rates.  In  this  case,  Bubbly,  Slug,  Chum  and  Annular  are 
four  significant  regimes  that  can  be  recognized  as  standard  patterns  in  the  chemical  industry.  The 
characteristics  of  these  four  patterns  are  shown  in  Figure  1.  Each  of  these  four  patterns  has  a 
distinguished  air/water  density  and  flow  speed  ratio.  The  calculation  of  the  flow  rates  is  required 
to  ensure  that  all  the  flow  patterns  could  be  observed.  In  order  to  obtain  all  the  required  flow 
rates  with  the  equipment,  the  flow  regime  map  developed  by  Fair  [12]  was  used.  The  map, 
shown  in  Figure  2,  is  a  plot  of  a  Martinelli  parameter  [13]  Xu ,  given  by 


X*  = 


(— 

* 


,0.9 


fn  V-5  (  „  V'1 

A*/ 


\Pl  J 
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versus  the  total  mass  velocity,  GT ,  defined  as 

+  m,  \ 


Gt=-L 


s  ’ 


(2) 


where  quantity  x  is  defined  as  the  mass  fraction  of  the  gas  phase  in  the  two  phase  mixture  and  is 
given  by 


x  = 


mg  +  m, 


(3) 


Other  parameters  involved  are  defined  as 

mg  -  gas  flow  rate,  in  lb/s; 
mt  -  liquid  flow  rate,  in  lb/s; 
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pg  -  gas  density,  in  lb/ft3 ; 

p,  -  liquid  density,  in  lb/ft 3 ; 

pg  -  gas  viscosity,  in  lb/ft-s; 

p,  -  liquid  viscosity,  in  lb/ft-s;  and 

S  -  area  of  cross  section  of  the  pipe,  in  ft 2 . 


In  our  experiment,  the  phase  flow  rates  can  be  measured  by  given  meters.  The  pipe  radius  is 
known,  and  the  density  and  viscosity  values  of  air  and  water  are  given  in  standard  look  up  tables. 
Therefore,  we  can  map  the  obtained  data  to  the  exact  position  on  the  fair  map,  which  provides  a 
reference  to  the  flow  regimes  of  a  AE  classification  system. 


Figure  1  Water/air  flow  ratio  of  four  major  two-phase  vertical  flow  patterns 


3.  ACOUSTIC  EMISSION  TECHNIQUE 

Acoustic  emission  testing  is  a  powerful  method  for  examining  the  behavior  of  materials  in 
which  a  transient  elastic  wave  is  generated  by  a  rapid  release  of  energy.  During  the  AE  test,  the 
sensors  on  the  test  piece  produce  any  number  of  transient  signals.  A  signal  from  a  single,  discrete 
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event  is  known  as  a  burst-type  signal.  This  type  of  signal  has  a  fast  rise  time  and  a  slower  decay 
as  illustrated  in  Figure  3.  Burst-type  signals  vary  widely  in  shape,  size  and  rate  of  occurrence 

depending  on  the  structure  and  test  conditions.  Several  parameters  of  an  AE  signal  need  to  be 
defined  as  following: 

a)  AE  threshold  -  A  predefined  value  to  indicate  the  occurrence  of  an  AE  hit  and  a  number  of 
AE  counts. 

AE  hit  (event)  -  Occurs  when  the  amplitude  value  of  the  sensor  output  signal  is  higher  than 
the  predefined  threshold.  6 

AE  signal  duration  —  The  period  between  hit  starting  and  ending  points. 

AE  ring-down  count  -  The  number  of  the  threshold-crossing  pulses. " 


b) 

c) 

d) 


Figure  4  shows  the  frequency  spectrum  of  an  AE  signal. 


Since  AE  signals  in  our  experiment  are  of  relatively  short  durations  (less  than  1  msec)  reach 
maximum  amplitude  early  m  the  signal  (always  assume  0)  and  decay  exponentially,  as  shown  in 
Figure  3,  we  can  calculate  the  sensor  output  as: 


V(t)  =  Vn  e~rt  sin  wt 


(4) 
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where: 


Fit)  -  output  voltage  of  sensor; 
V0  -  initial  signal  amplitude; 

r  -  decay  constant  (>0); 
t  -  time;  and 
w  -  signal  frequency. 


Since  Threshold  voltage  V*  has  been  set  up,  we  can  count  the  number  of  times  the  sensor 
voltage  exceeds  it.  This  technique  is  known  as  ring  down  counting.  For  the  signal  represented  by 
Equation  (4),  the  number  of  counts  (TV)  to  the  nearest  integer  is  given  by 


★ 


2n  /  w 


where 


(5) 


V  = 


V0e-n 


and  t 


(6) 


For  the  given  air-water  two-phase  flow  classification  problem,  we  obtain  the  average  values  of 
each  second  for  those  related  parameters  introduced  above.  Table  1  shows  the  comparison  results 
for  the  four  maj  or  patterns. 


Table  1  AE  parameter  values  of  four  ma  jor  flow  regimes 


Bubbly 

Slug 

-  i  ^ 

Churn 

Annular 

Average  number  of  AE 
hits  occurs  in  one 
second 

5 

(0-58) 

1 

185 

(134-243) 

39 

(4-92) 

Average  number  of  AE 
counts  occurs  in  one 
second 

87 

(0~32) 

749 

(17-4923) 

8192 

(2487-14236) 

30521  . 

(13510-44673) 

Average  value  of 
amplitude  for  the  AE 
hits  occurs  in  one 
second  (dB) 

61.56 

(60.17~61.92) 

60.59 

(59.81-60.7) 

61.43 

(59.93-61.65) 

61.13 

(60.11-61.96) 

Average  Rise  time  for 
the  AE  hits  occurs  in 
one  second  (us) 

17.7 

(1-59) 

20.49 

(1-429) 

26.47 

(1-711) 

35.37 

(1-717) 

Average  Duration  time 
for  the  AE  hits  occurs 
in  one  second  (us) 

mi.  _  _  _  t  •  ax 

75 

(1-264) 

127.9 

(1-8468) 

314.6 

(1-43012) 

430.6 

(1-132259) 

*  The  values  in  Q  are  the  value  range  of  the  given  AE  parameters  in  one  second 


From  data  analysis  summarized  in  Table  1,  we  can  see  that  the  number  of  AE  ring-down 
counts,  occurring  in  one  second,  is  the  most  reliable  indicator  for  the  given  pattern  classification 
problem.  To  ensure  the  reliability  of  the  final  classification  result,  we  also  combine  the  AE  hit 
(event)  number,  occurring  in  one  second,  to  be  another  indicator.  By  mapping  one-second  data  to 
a  point  on  the  Hit-Count  map,  we  can  classify  the  four  major  flow  regimes,  which  will  be  shown 
in  Section  5.  For  a  more  complicated  gas-liquid  vertical  column  to  be  monitored,  all  features 
discussed  above  can  be  integrated  into  the  decision-making  process. 
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4.  AE  CLASSIFICATION  SYSTEM 


The  hardware  configuration  of  the  proposed  real-time  AE  air-water  two-phase  flow- 
classification  system  is  shown  in  Figure  5.  The  system  includes  data  acquisition,  signal 
processing,  data  analysis  and  decision-making.  In  Figure  5,  sensor  A  is  an  AE  sensor  which  is 
attached  on  the  pipe  for  detecting  AE  signals  that  occurs  in  flows,  and  sensor  B  is  same  type  of 
piezoelectric  AE  sensor,  which  is  located  near  the  sensor  A  and  designated  for  detecting 
background  noise.  Related  AE  hardware  parameters  are  listed  in  Table  2.  The  output  data  from 
AE  sensors  are  amplified  and  filtered  before  the  A/D  conversion.  In  data  analysis  and  decision 
making  parts,  the  input  of  the  MLP  neural  network  is  the  discrete  data  from  the  AEDSP  card 
manufactured  by  Physical  Acoustics  Corporation.  The  network  output  determines  the  current 
position  on  the  Hit-Count  map,  which  can  be  the  indicator  of  the  decision-making  Graphic  User 
Interface  (GUI)  software. 


Data  Analysis  and  Decision  Making 


Lh  mj 

1  MB 


RbwDbU 

Storage 


Long  Term  Storage  of 
Processing  Drt-i 


PJoter(opdocuJ) 

Vtd*o(optiomJ) 


Data  Output 


Figure  5  Configuration  of  experimental  AE  detection  system 


Table  2  AE  hardware  parameter  values 


AE  Sensor  Resonant  Frequency 

ISOKHz 

Sampling  Rate  of  DSP  Board 

1MHz 

Gain  of  Amplifier 

40dB 

Time  Window  of  Each  AE  Hit 

1024 points  (256 ps) 

Threshold  Voltage 

0.0586V  (35dB)  1 

5.  EXPERIMENTAL  RESULTS 

In  the  experiment,  we  designed  a  continuous  regime-changing  process  which  includes  12 
steady  states  and  1 1  transient  states  during  20  minutes  data  acquisition  time.  Each  steady  state 
keeps  a  pair  of  typical  air  and  water  flow  rates  (Figure  6)  for  about  30  seconds  and  then  transfers 
to  another  steady  state.  The  state  between  two  steady  states  is  the  transient  state,  which  has 
unstable  phase  flow  rates.  The  phase  flow  rates  of  all  the  steady  and  transient  states  were 
recorded  to  generate  our  reference  Fair  map  (Figure  7).  Meanwhile,  the  AE  hit  and  count  number 
for  each  second-period  are  also  measured  and  stored  during  the  entire  process. 
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Figure  6  Air  and  water  flow  rates  for  steady  states  and  transient  states 


Figure  7  Reference  Fair  map  for  the  four  major  Regimes 


Figure  8  shows  the  corresponding  AE  Hit-Count  map,  and  each  marked  point  represents  the 
summation  AE  hit  number  and  count  number  for  one  second.  From  this  Hit-Count  map,  we  can 
see  that  regime  “Annular”  is  clearly  separated  from  the  other  three  regimes.  Regimes  “Chum” 
and  “Slug”  are  also  well  separated,  although  there  exists  a  small  overlapping  between  these  two 
regimes.  The  regime  “Bubbly”  and  “Slug”  can  not  be  linearly  separated  since  the  overlapping 
area  between  them  is  not  trivial.  However,  we  can  train  a  nonlinear  classifier,  such  as  MLP 
neural  network  to  classify  them,  because  the  majority  of  these  two  regimes  are  separable. 

In  our  experiment,  we  use  a  two-hidden-layer  neural  network  with  10  neurons  in  each  hidden 
layer.  We  randomly  select  600  data  points  for  training  and  the  other  600  data  points  for  testing. 
Levenberg-Marquardt  with  Bayesian  regularization  algorithm  is  applied,  and  the  learning  rate  is 
set  to  be  0.1.  The  training  epoch  is  1,000,  and  the  stopping  mean  square  error  is  le-5.  Figure  9 
(a)  and  (b)  show  the  training  result  Hit-Count  map  and  the  training  target  Hit-Count  map.  Figure 
10  (a)  and  (b)  show  the  testing  result  Hit-Count  map  and  testing  target  Hit-Count  map. 
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AE  count  number  AE  count  number 


AE  hit  number 


Figure  8  AE  Hit-Count  map  corresponding  to  the  Fair  map  in  Figure  7 


AE  hit  number 


(a)  (b) 

Figure  9  (a)  Neural  network  training  result  Hit  Count  map  and  (b)  Training  target  Hit-Count  map 


AE  hit  number 


AE  hit  number 


(a)  (b) 

Figure  10  (a)  Neural  network  testing  result  Hit-Count  map  and  (b)  Testing  target  Hit-Count  map 
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Comparing  the  training  and  testing  result  Hit-Count  maps  with  the  target  maps  in  Figures  9 
and  10,  we  can  see  that  the  overlapping  areas  between  adjacent  regimes  have  disappeared  due  to 
the  clustering  character  by  the  neural  network.  In  this  case,  the  designed  neural  network  can 
improve  the  classification  performance  by  reducing  the  misclassification  rate  for  the  given  air- 
water  two-phase  flow  regime  classification  problem.  The  output  of  the  neural  network  is 
transferred  to  the  decision-making  and  Graphic  User  Interface  software.  This  software  is 
designed  to  have  the  following  functions:  1)  indicate  the  current  flow  regime;  2)  control  the 
starting  and  stopping  of  the  data  acquisition  process  by  user;  and  3)  perform  database 
management  for  history  record  inquiry.  Figure  11  shows  the  GUT  software.  The  developed 
interface  allows  the  operator  to  visually  monitor  the  process  to  facilitate  the  expert  decision¬ 
making.  By  integrating  with  concepts,  such  as  distributed  virtual  instrumentation,  the  operator 
not  only  can  remotely  monitor  the  process,  but  activate  the  control  law  for  reconfiguration  via 
Ethernet  [14]. 

6.  CONCLUSION 

The  application  of  one  of  the  NDT  techniques  (i.e.,  AE)  and  Neural  Networks  on  vertical  air- 
water  two-phase  flow  pattern  recognition  problems  was  proposed  and  discussed.  In  this  study, 
several  AE  parameters  were  extracted  from  four  major  two-phase  flow  pattern  signals,  and  the 
results  were  discussed.  AE  hits  (events)  and  Ring-Down  Counts  density  can  be  combined  as  a 
stable  and  excellent  indicator  to  describe  flow  patterns  accurately.  They  form  the  input  stream  of 
multi-layer  perceptron  neural  network.  After  training  the  network,  the  system  output  can  tell  the 
continuous  flow  stage  (including  four  major  patterns  and  transient  states)  on-line  and  real  time. 
This  AE  combined  NN  detection  system  may  be  easily  transferred  to  other  gas/liquid  two-phase 
flow  regime  classification  problems,  such  as  saturated  steam  flow,  which  is  widely  related  to 
different  industrial  processes  for  heat  energy  transfer,  power  source,  sanitary  flushing,  and  etc. 
Some  common  flow  regimes  for  saturated  steam  flows  are  uniform  density  regime,  annular 
regime,  slug  regime,  and  asymmetric  density  regime.  While  not  demonstrated,  we  fully  believe 
that  the  proposed  acoustic  emission  monitoring  system  will  work  equally  well  for  saturated 
steam  and  other  similar  two-phase  flow  systems. 


Figure  11  GUI  software  of  the  AE  flow  regime  detection  system 
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Abstract — In  this  study,  we  propose  a  novel  hybrid  intelligent  system  (HIS)  which  provides  a 
unified  integration  of  numerical  and  linguistic  knowledge  representations.  The  proposed  HIS  is  a 
hierarchical  integration  of  an  incremental  learning  fuzzy  neural  network  (ILFN)  and  a  linguistic 
model,  i.e.,  fuzzy  expert  system  (FES),  optimized  via  the  genetic  algorithm  (GA).  The  ILFN  is  a 
self-organizing  network  with  the  capability  of  fast,  one-pass,  online,  and  incremental  learning. 
The  linguistic  model  is  constructed  based  on  knowledge  embedded  in  the  trained  ILFN  or 
provided  by  the  domain  expert.  The  knowledge  captured  from  the  low-level  ILFN  can  be 
mapped  to  the  higher-level  linguistic  model  and  vice  versa.  The  GA  is  applied  to  optimize  the 
linguistic  model  to  maintain  high  accuracy,  comprehensibility,  completeness,  compactness,  and 
consistency.  The  resulted  HIS  is  capable  of  dealing  with  low-level  numerical  computation  and 
higher-level  linguistic  computation.  After  the  system  being  completely  constructed,  it.  can 
incrementally  learn  new  information  in  both  numerical  and  linguistic  forms.  To  evaluate  the 
system’s  performance,  the  well-known  benchmark  Wisconsin  breast  cancer  data  set  was  studied 
for  an  application  to  medical  diagnosis.  The  simulation  results  have  shown  that  the  proposed  HIS 
perform  better  than  the  individual  standalone  systems.  The  comparison  results  show  that  the 
linguistic  rules  extracted  are  competitive  with  or  even  superior  to  some  well-known  methods. 

Index  terms — ILFN,  fuzzy  expert  system,  GA,  hybrid  intelligent  system,  pattern  classification, 
decision  support  system,  medical  diagnosis,  Wisconsin  breast  cancer  database 

I.  INTRODUCTION 

Conventional  medical  diagnosis  in  clinical  examinations  highly  relies  upon  physicians’ 
experience.  Physicians  intuitively  exercise  knowledge  obtained  from  previous  patients’ 
symptoms.  In  everyday  practice,  the  amount  of  medical  knowledge  grows  steadily  such  that  it 
may  become  difficult  for  physicians  to  keep  up  with  all  the  essential  information  gained.  For 
physicians  to  quickly  and  accurately  diagnose  a  patient,  there  is  a  critical  need  in  employing 
computerized  technologies  to  assist  in  medical  diagnosis  and  accessing  to  the  information 
related.  Computer-assisted  technology  is  certainly  helpful  for  inexperienced  physicians  in 
making  medical  decisions  as  well  as  for  experienced  physicians  in  supporting  complex  decisions. 
Computer-assisted  technology  has  become  an  attractive  tool  to  help  physicians  in  retrieving  the 
medical  information  as  well  as  in  making  decisions  in  medical  diagnosis  [l]-[7]. 

A  number  of  medical  diagnostic  decision  support  systems  (MDSS)  based  on 
computational  intelligence  methods  have  been  developed  to  assist  physicians  and  medical 
professionals.  Some  medical  diagnosis  systems  based  on  computational  intelligence  methods  use 

*  This  work  was  supported  in  part  by  the  Royal  Thai  Government  and  the  U.S.  Air  Force  Office  of  Scientific 
Research  under  grant  F49620-98-1-0049. 


fSS  frS*™SSs)  C8M11].  ftizzy  expert  systems  (FESs)  [12]-[15],  artificial  neural  networks 
ANNs)  [16]-[19],  and  genetic  algorithms  (GAs)  [20]-[22].  ESs  and  FESs,  used  symbolic  and 
linguistic  knowledge,  respectively,  are  well  recognized  as  applications  in  medical  diagnosis 
because  their  decisions  are  easy  to  understand  by  physicians  and  medical  professionals 
However,  the  development  of  an  ES  or  a  FES  for  medical  diagnosis  is  not  a  trivial  task  It 
demands  an  intensive  and  iterative  process  from  medical  experts  who  may  not  be  readily 
available.  ANNs  have  been  employed  to  learn  numerical  data  recorded  from  sensory 
measurements  or  images.  After  being  trained,  ANNs  keep  knowledge  in  numerical  weights  and 
biases  that  are  often  regarded  as  a  black  box  scheme.  This  knowledge  is  difficult  for  physicians 
°rTx?TCa  Professionals  t0  understand  the  underlying  rationale.  Recently,  the  numerical  weights 
b®.translated  t0  symbolic/linguistic  rules  by  using  rule  extraction  algorithms  [231- 
[  9J.  Symbolic/lmguistic  rules  extracted  are  then  used  as  a  knowledge  base  for  an  ES  or  a  FES  to 
support  the  physicians  in  making  decisions  for  medical  diagnosis  [23],  [26]  [281-I301.  The 
resulted  knowledge  base  is  often  incomplete  and  inefficient.  It  may  perform  poorly  in  unseen 

To  improve  the  accuracy  of  a  decision-making  system  such  as  an  application  in  medical 
diagnosis  an  integration  of  symbolic/linguistic  processing  and  numerical  computation  is 
motivated  to  a  research  area,  namely  hybrid  intelligent  architectures  [26],  [31W341.  Hvbrid 
intelligent  architectures  tend  to  be  more  appropriate  in  applications  that  require  both  numerical 
computation  for  higher  generalization  and  symbolic/linguistic  reasoning  for  explanation.  It  is 
found  that  hybridization  between  symbolic/linguistic  and  numerical  representations  can  achieve 
higher  accuracy  compared  to  either  one  alone  [24],  [26],  [34], 

Most  of  hybrid  intelligent  system  has  focused  on  the  accuracy  and  the  interpretability  In 
a  learning  system  an  incremental  learning  capability  is  considered  an  important  attribute  aside 
om  accuracy  and  interpretability.  As  for  medical  diagnosis,  patient  data  grows  everyday  when 
new  symptoms  are  discovered.  Novel  medical  knowledge  should  be  quickly  incorporated  into 
medical  diagnosis  system  without  spending  tremendous  time  in  the  learning  process.  In  a  hybrid 
system,  usually  multilayer  perceptron  (MLP)  neural  networks  are  used  as  a  nLerical  model  that 
is  thT?  by  b^kpr°Pagatlon  alg°nthms  [26],  One  well-known  problem  of  the  backpropagation 

llonrith  1  I  T  u°rmP  °^  ^  mcremental  leaminS  feature-  The  standard  backpropagation 
nS  TaCkS  h6  f1 lh  V°  dynamically  g^erate  extra  nodes  or  connections  during  lefming 
£ct  3?‘  medlcal  ProbIems,  new  knowledge  of  symptoms  may  be  found  after  the  diagnosis 
system  has  been  constructed.  Usmg  the  backpropagation  algorithms  all  old  and  new  data  have  to 
e  retrained  in  order  to  update  the  knowledge  to  cover  the  new  symptoms.  With  an  incremental 
learning  algorithm  the  new  knowledge  can  be  learned  and  added  to  the  system  withou 

retraimng  previously  learned  data  [35]-[38].  y 

The  contribution  of  this  study  is  to  develop  a  pattern  classifier  system  (i.e.,  a  decision 
support  system)  that  is  concerned  not  only  accuracy  and  interpretability  but  also  an  incremental 
Earning  concept.  We  propose  a  hybrid  intelligent  system  (HIS)  that  is  composed  of  numerical 
model  m  low  level  and  linguistic  model  in  higher  level  and  is  equipped  with  an  incremental 
learning  algorithm.  The  Proposed  system  is  a  hierarchical  integration  of  an  incremental  learning 

Sugen^T^  fi  °rki^^  S7]’  [38]  md  a  fozzy  exPert  system  (EES)  based  on  Takagi- 
Sugeno  (TS)  fuzzy  model  [39].  The  ILFN  is  a  self-organizing  network  with  the  capability  of  fast 

online  learning  The  ILFN  can  learn  incrementally  without  retraining  old  information.  The 

linguistic  model,  FES,  is  constructed  based  on  knowledge  embedded  in  the  trained  ILFN.  The 

knowledge  captured  from  the  low-level  ILFN  can  be  mapped  to  the  higher-level  FES  and  vice 
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versa.  The  system  is  equipped  with  a  conflict  resolution  scheme  to  maintain  consistency  in 
decision-making.  The  low-level  ILFN  contributes  fast,  incremental  learning  while  the  higher- 
level  FES  offers  advantages  of  dealing  with  fuzzy  data.  It  provides  easy  interpretation  and 
explanation  to  the  decision  made.  The  GA  [40]  is  applied  to  optimize  the  linguistic  model  to 
maintain  high  accuracy  and  comprehensibility.  The  resulted  HIS  is  capable  of  dealing  with  low- 
level  numerical  data  and  higher-level  linguistic  information.  After  being  completely  constructed, 
the  system  can  incrementally  learn  new  information  in  both  numerical  and  linguistic  forms. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  II  discusses  the  proposed 
HIS  architecture.  To  demonstrate  the  effectiveness  and  efficiency  of  the  proposed  system, 
numerical  simulations  and  benchmark  comparisons  are  presented  in  Section  III.  Section  IV 
provided  -some  concluding  remarks  and  future  research  directions. 

II.  THE  ARCHITECTURE  OF  THE  PROPOSED  HYBRID  INTELLIGENT  SYSTEM 


Figure  1:  The  Architecture  of  the  Proposed  Hybrid  Intelligent  System 

The  proposed  HIS,  shown  in  Figure  1,  is  constituted  of  five  components:  1)  an  ILFN,  2)  a 
FES,  3)  a  network-to-rule  module,  4)  a  rule-to-network  module,  and  5)  a  decision-explanation 
module.  Input  data  is  brought  into  the  system  through  both  the  ILFN  and  the  FES.  The  ILFN  and 
the  FES  are  linked  together  by  a  network-to-rule  module  that  is  a  rule  extraction  algorithm  for 
mapping  the  ILFN  to  the  FES,  and  a  rule-to-network  module  that  is  an  algorithm  for  mapping  the 
FES  to  the  ILFN.  The  outputs  of  the  ILFN  and  the  FES  connect  to  the  decision-explanation 
module  which  makes  decisions  and  explanations  based  on  the  information  received  from  both 
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the  ILFN  and  the  FES.  The  human  operators  can  interact  with  the  system  through  a  user 
interaction  module.  The  details  of  each  module  are  discussed  in  the  following  sections. 

A.  Incremental  Learning  Fuzzy  Neural  Network  (ILFN) 

The  ILFN  is  a  self-organizing  network  that  equips  with  an  on-line,  incremental  learning 
algorithm  capable  of  learning  all  training  patterns  within  only  one  pass.  Gaussian  radial  basis 
functions  are  used  to  form  the  distribution  of  the  pattern  space.  The  ILFN  can  learn  in  a 
supervised  or  an  unsupervised  fashion  [37],  [38].  The  ILFN  has  four  layers:  one  input  layer,  one 
hidden  layer,  one  output  layer,  and  one  decision  layer,  as  shown  in  Figure  2.  Alternatively,  the 
system  can  be  viewed  as  two  subsystems:  an  input  subsystem  and  a  target  subsystem.  Each 
subsystem  has  three  layers:  one  input  layer,  one  hidden  layer,  and  one  output  layer.  The  hidden 
layer  of  both  the  input  subsystem  and  the  target  subsystem  are  linked  together  via  a  controller 
module  which  is  used  to  control  the  growing  neurons  in  the  hidden  layer.  Each  output  layer  of 
both  subsystems  consists  of  two  modules.  The  output  layer  of  the  input  subsystem  consists  of  a 
pruning  module  and  a  membership  module,  while  the  output  layer  of  the  target  subsystem 
consists  of  a  pruning  module  and  a  target  module.  The  membership  module  of  the  input 
subsystem  and  the  target  module  of  the  target  subsystem  are  simultaneously  updated  with  their 
number  of  neurons  controlled  by  the  pruning  modules.  The  output  of  the  classifier  is  linked 
together  via  a  decision  layer. 


Figure  2:  The  ILFN  Classifier  Architecture 


ILFN  network  uses  four  weighting  parameters:  WP,  WT,  STD,  count,  as  well  as  one 
threshold  parameter,  s.  WP  is  the  hidden  weight  of  the  input  subsystem.  Each  row  (i.e.,  a  node  in 
the  hidden  layer)  of  WP  represents  a  mean  or  centroid  of  a  cluster.  Each  node  of  WT  stores  the 
corresponding  target  of  the  input  prototype  patterns.  STD  and  count  are  the  standard  deviation 
and  the  number,  respectively,  of  patterns  that  belong  to  each  node,  s,  selected  within  the  range 
[0, 1],  is  the  threshold  parameter  that  controls  the  number  of  clusters.  The  system  generates  many 
clusters  if  s  is  large  and  few  clusters  if  it  is  small.  However,  clusters  that  belong  to  the  same  class 
are  grouped  together  via  the  pruning  module.  The  details  of  learning  algorithm  can  be  found  in 
[37],  [38], 
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Figure  3:  ILFN  Decision  Boundaries  of  a  Three-Class,  Two-Dimensional  Pattern  Space 

Figure  3  shows  the  ILFN  that  is  used  to  classify  a  three-class,  two-dimensional  pattern 
space.  Three  circles  are  the  three  clusters  of  the  three  classes  centered  at  wP),wP2,  and  wP3. 
The  membership  values  are  highest  when  the  patterns  are  located  at  the  centers  of  the  clusters. 
The  membership  values  monotonically  decrease  when  the  distances  between  the  patterns  and  the 
centers  of  the  clusters  increase.  The  size  of  the  circle  depends  on  the  variances  of  the  patterns 
that  belong  to  the  clusters.  A  pattern  outside  a  circle  indicates  a  near-zero  membership  degree  of 
belonging  to  the  cluster.  The  dashed  line  indicates  the  boundaries  of  each  cluster. 

A  trained  ILFN  does  not  exhibit  a  clear  meaning  of  knowledge  embedded  inside  its 
structure.  Linguistic  knowledge  is  more  preferable  if  an  explanation  about  the  decision  is  needed. 
Thus  it  is  desirable  to  transform  the  knowledge  of  the  trained  ILFN  into  a  form  that  is  easier  to 
comprehend  in  linguistic  from  used  in  a  FES.  A  FES  has  a  close  relationship  with  an  ILFN 
network  in  that  it  can  be  mapped  from  one  to  another.  In  order  to  employ  both  numerical 
calculation  from  an  ILFN  and  linguistic  processing  from  a  FES,  we  will  combine  both  a  trained 
ILFN  and  the  mapped  FES  into  the  same  hybrid  system.  The  output  decision  of  the  hybrid 
system  is  based  on  both  the  ILFN  and  the  FES.  The  resulting  hybrid  system  would  provide 
complementary  features  from  both  the  ILFN  and  the  FES.  The  hybrid  system  seems  to  show  the 
ability  to  deal  with  more  complex  problems  that  need  an  explanation  capability. 

The  next  section  describes  the  details  of  the  FES  used  in  this  study,  as  well  as  how  to 
map  knowledge  from  a  trained  ILFN  to  a  FES  and  vice  versa. 

B.  Fuzzy  Expert  System  (FES) 

A  FES  can  be  thought  of  as  a  special  kind  of  expert  systems  (ESs).  In  fact,  a  FES  is  an  ES 
that  is  incorporated  with  fuzzy  sets  [41].  Thus,  a  FES  exhibits  transparency  to  users.  Users  can 
easily  understand  the  decision  made  by  a  FES  due  to  the  fact  that  the  rule  base  is  in  “if-then” 
form  used  in  natural  languages.  From  a  knowledge  representation  viewpoint,  a  fuzzy  if-then  rule 
is  a  scheme  for  capturing  knowledge  that  is  imprecise  by  nature. 

Figure  4  illustrates  a  schematic  diagram  of  a  FES.  A  FES  is  composed  of  four  main 
modules:  a  fuzzifier,  an  inference  engine,  a  defiizzifier,  and  a  knowledge  base.  The  function  of 
the  fuzzifier  is  to  determine  the  degree  of  membership  of  a  crisp  input  in  a  fuzzy  set.  The  fuzzy 
knowledge  base  is  used  to  represent  the  fuzzy  relationships  between  input  and  output  fuzzy 
variables.  The  output  of  the  fuzzy  knowledge  base  is  determined  by  the  degree  of  membership 
specified  by  the  fuzzifier.  The  inference  engine  utilizes  the  information  from  the  knowledge  base 
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as  well  as  from  the  fuzzifier  to  infer  additional  information.  The  output  of  a  FES  can  be  fuzzy 
values  from  the  inference  engine  process.  The  output  in  fuzzy  value  format  is  advantage  in 
pattern  classification  problems  since  the  fuzzy  values  indicate  the  degree  of  belongings  of  a 
given  pattern  to  class  prototypes.  Optionally,  the  defuzzifier  is  used  to  convert  the  fuzzy  output 
of  the  system  into  crisp  values. 


Crisp  input 


Knowledge  Base 


Fuzzifier 


Inference  Engine 


Defuzzifier 


Crisp  output 

- — ► 


y 


Fuzzy  output 


Figure  4:  A  Fuzzy  Expert  System  (FES) 

In  a  FES,  a  knowledge  base  is  used  for  an  explanation  purpose  as  well  as  in  a  decision 
making  process.  A  knowledge  structure  used  in  the  proposed  FES  comprises  of  1)  input  features’ 
names,  2)  variables’  ranges,  3)  number  of  linguistic  labels,  4)  linguistic  labels,  5)  membership 
functions,  6)  membership  functions’  parameters,  and  7)  fuzzy  if-then  rules.  The  information 
about  the  knowledge  structure  of  the  FES  can  be  provided  by  experts  or  automatically  generated 
from  data.  For  an  M-dimensional  pattern  space,  the  components  in  the  knowledge  base  used  in 
the  FES  are  detailed  as  follows. 

Knowledge  Base  (K) 

K  =  {  FN,  VR,  NL,  Ling,  MF,  MP,  R  }  (1) 

Input  Features  ’  Names  (FN) 

FN=  (2) 


Variables  ’  Ranges  (VR) 

Vminx ,  Vmaxx 
Vmin2,Vmax2 


VR  = 


VminM ,  Vmax 


M  J 


(3) 


Number  of  Linguistic  Labels  (NL) 

NL  =  {N„N2,-,Nu}-  Nj  s{2,-,  9}  (4) 

To  maintain  comprehensibility  of  the  linguistic  model,  the  number  of  the  linguistic  variables 
should  be  as  small  as  possible.  It  is  suggested  that  it  should  not  be  larger  than  nine  [42]. 


6 


Linguistic  Labels  (Ling) 

f  i.'n.  ->',J  1 


Ling  = 


i  i  1 

tf2I »  *22 * '  * ' »  l2N1  ) 

where  Ijk  is  a  linguistic  label  in  the /h  dimension  and  k  is  the  index  to  it. 
Membership  Functions  (MF) 

Mf  _  {mf2i>mf22>-~,mf2N J  ^ 

Membership  Functions  ’  Parameters  (MP) 

'  {mpn,mpI2,  -,mpWi}  ] 

Mp  _  <  {mp21,mp22, -,mp2Wj} 

{mpA/1,mpA/2,-",mPw^}j 

mp Jk  =  { ™Pjki>™Pjk2>-’mPjknik } ;  j  =  h  ...,M\  k=  1, ..., Nj. 


(5) 


(6) 


(7) 


(8) 


Assume  that  mfjk  is  a  Gaussian  membership  function,  mp;*  has  two  variables  which  are  a  and 
p,  a  standard  deviation  and  a  mean  of  a  Gaussian  membership  function,  respectively.  Thus,  we 
have 


mp  jk  =  {mpjku  mpjia}  =  {aJk,  p,*};  j=  1,  ...,M;  k=  1,  ...,  Nj. 


(9) 


Fuzzy  If-Then  Rules  (R) 


8  {AixA/,Bix],CFixl} 


Aim 

CFX ' 

Al 

An 

•••  A2  m 

b2 

cf2 

> . 

(10) 

Ai\ 

AL2 

■■■  Am 

bl 

cfLj 

ALxM  represents  the  antecedent  part  of  the  if-then  rules;  Bixl  and  CFixl  constitute  the 
consequent  part  of  the  if-then  rules;  where  L  and  M  is  the  number  of  fuzzy  rules  and  the 

dimension  of  the  pattern  space,  respectively.  Ay,  i  =  1 . L,j  =  1, M,  is  the  antecedent  of 

the  2th  rule  for  the  /h  dimension.  Ay  e  {0,l,---,Ny }  is  the  index  of  a  linguistic  label  in  the  /h 
dimension  of  the  linguistic  labels  (Ling)  of  the  2th  rule.  If  Ay  is  “0”  then  the  system  uses  a  don ’t 
care  label  in  which  its  activation  function  is  always  a  unity  membership  grade.  5,  is  a  constant 
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value  that  is  a  class  consequent  part  of  the  z'th  rule.  CF„  a  value  in  [0,  1],  is  a  confident  factor  of 
the  zth  rule.  In  a  FES,  for  a  finite  class  pattern  classification  problem  with  an  M-dimensional 
pattern  space,  linguistic  knowledge  can  be  written  as  a  set  of  fuzzy  if-then  rule  in  a  natural 
language  as  follow: 


i?;:  IF  x,  is^n  AND  x2  is  An  AND  ...  AND  xM  is  Am, 

THEN  x  =  {xj,  x2,  ...,  XM}belongs  to  Class  2?,  with  confident  factor  =  CF,;  (11) 

where  R.,: !  =  1 ,...,!,  is  the  label  of  the  zth  rule  and  Ay  indicates  a  linguistic  label  such  as  small, 
medium ,  or  large. 


Assume  that  Gaussian  membership  functions  are  employed  in  the  FES.  The  rule  firing 
strength  <j>,.  can  be  computed  by  the  following  equation: 


4>(-  =  min  {  exp 


" 

2“ 

xJ-»« 

a.. 

_ 

V 

y  J 

_ 

} ,  for  z  =  1,  ...,L;j  =  1, 


(12) 


where  \x.y  and  ay  are  the  mean  and  the  standard  deviation,  respectively,  of  the  linguistic  label 
indexed  by  Ay;  and  min  is  a  T-norm  operator  which  can  be  replaced  by  product. 

After  computing  the  firing  strength  from  each  rule,  the  class  output,  Cy,  is  calculated  by 
using  the  inference  mechanism  as  follow: 


Cy  =  Bj ;  J  =  arg  max  ( <|)(.  x  CFt). 

i 


(13) 


In  developing  a  FES,  developers  must  pay  attention  to  several  issues  such  as  accuracy, 
comprehensibility,  compactness,  completeness,  and  consistency.  Accuracy  is  a  quantitative 
measure  that  indicates  the  performance  of  a  FES  in  classifying  both  training  and  testing  data. 
Comprehensibility  indicates  how  easily  a  FES  can  be  accessible  by  human  beings.  Generally, 
comprehensibility  of  a  FES  depends  on  the  following  aspects:  the  distinguishability  of  the  shapes 
of  membership  functions,  the  number  of  fuzzy  if-then  rules,  and  the  number  of  antecedent 
conditions  of  fuzzy  if-then  rules.  Compactness  involves  the  size  of  fuzzy  if-then  rules  and  the 
number  of  antecedents  of  fuzzy  if-then  rules.  More  compactness  of  a  fuzzy  system  usually  yields 
higher  comprehensibility.  Completeness  assures  that  a  FES  will  provide  a  non-zero  output  for 
any  given  input  in  the  input  space.  Consistency  makes  sure  that  fuzzy  if-then  rules  are  not 
conflicting  with  each  other  as  well  as  human  senses.  Fuzzy  if-then  rules  are  inconsistent  if  they 
have  very  similar  antecedents,  but  different  consequents,  and  they  conflict  with  the  expert 
knowledge.  If  there  are  fuzzy  if-then  rules  that  are  conflicting  to  each  other,  the  rules  become 
unclear  [43]-[46].  The  conflicting  rules  need  to  be  resolved. 

For  the  completeness  of  the  rule  structure  in  a  FES,  grid  partition  methods  are  widely 
used  for  partitioning  input  space  into  grid  cells.  Fuzzy  if-then  rules  can  be  obtained  by  using 
fuzzy  grid  partitions  [47].  Despite  its  advantage  of  providing  the  completeness  of  rule  structure, 
the  grid  partition  method  has  a  disadvantage  in  that  the  number  of  fuzzy  rules  increases 
exponentially  as  the  dimension  of  the  input  space  increases.  Since  each  cell  represents  a  fuzzy  if- 
then  rule,  the  number  of  fuzzy  if-then  rules  is  usually  very  large.  The  system  becomes  a  black¬ 
box  scheme  that  is  not  comprehensible  to  human  users.  Pattern  classification  problems  in  real 
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world  often  have  -large  dimensions.  It  is  undesirable  to  directly  use  the  grid-type  partitioning  for 
constructing  fuzzy  if-then  rules. 

To  obtain  a  smaller  number  of  fuzzy  rules,  projection  from  clusters  [43],  [45]  is  called 
for.  Clustering  algorithms  can  be  used  for  partitioning  data  points  into  a  small  number  of 
clusters.  Each  cluster  then  represents  a  fuzzy  relation  and  corresponds  to  a  rule.  The  fuzzy  sets 
in  the  antecedent  parts  of  the  rules  are  projected  from  the  clusters  onto  the  corresponding  axis  of 
the  data  space.  The  number  of  rules  from  projection  method  is  smaller  than  grid-type 
partitioning.  A  more  compact  linguistic  model  is  obtained.  However,  the  fuzzy  sets  that  are 
directly  projected  from  clustering  methods  may  not  be  transparent  enough.  The  number  of  fuzzy 
sets  from  the  projection  may  be  very  large  and  redundant  since  the  projected  fuzzy  sets  may  be 
very  similar  resulting  in  a  fuzzy  system  that  is  not  optimal.  The  problem  mentioned  above  can  be 
solved  using  rule  simplification  methods  [45],  [48].  Alternatively,  the  projection  from  trained 
ILFN  parameters  onto  fuzzy  if-then  rules. 

The  ILFN  groups  the  patterns  in  the  input  space  into  a  small  number  of  clusters.  Based  on 
grid  partition  methods,  the  clusters  and  its  parameters  of  the  trained  ILFN  can  be  mapped  to 
fuzzy  if-then  rules.  The  number  of  fuzzy  if-then  rules  is  equal  to  the  number  of  clusters  in  the 
trained  ILFN.  The  number  of  fuzzy  sets  in  each  dimension  depends  on  the  number  of  grid 
partitions  chosen.  The  parameters  of  fuzzy  sets  are  projected  from  the  cluster  parameters  of  the 
trained  ILFN.  Figure  5a  shows  the  projection  of  ILFN  to  one-dimensional  fuzzy  sets.  We  can  see 
that  the  fuzzy  sets  in  Figure  5a  have  some  similarity.  Combining  grid  partitioning  method  and 
projection  method  is  shown  in  Figure  5b.  The  parameters  of  fuzzy  sets  in  Figure  5b  projected 
and  adapted  from  the  clusters  of  the  trained  ILFN. 

Using  a  grid-based  projection  method,  the  fuzzy  if-then  rules  of  the  FES  are  extracted 
from  a  trained  ILFN.  The  hidden  numerical  weights  of  the  ILFN  are  mapped  into  initial  fuzzy  if- 
then  rules.  A  genetic  algorithm  is  then  used  to  select  only  discriminatory  features  resulting  in  a 
more  compact  rule  set  with  highly  transparent  linguistic  terms.  The  following  section  describes 
the  mechanism  used  to  map  the  ILFN  to  the  FES. 


ILFN  Cluster  Boundaries  FES  Grid  Partitions 


Figure  5:  a)  Projection  of  ILFN  to  One-Dimensional  Fuzzy  Sets 

b)  FES  Grid  Partition  with  its  Parameter  Projected  from  Trained  ILFN 

C.  Network-To-Rule  Module 

Since  the  knowledge  embedded  in  the  ILFN  is  not  in  linguistic  form,  the  ILFN  lacks  of 
an  explanation  capability.  ILFN  weights  can  be  extracted  by  using  a  rule  extraction  algorithm  to 
obtain  linguistic  rules.  A  meaningful  explanation  in  reasoning  process  can  then  be  generated 
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from  the  linguistic  model  i.e.,  the  FES.  The  mechanism  used  for  mapping  a  trained  ILFN  to  a 
linguistic  knowledge  base  operates  inside  the  network-to-rule  module.  The  mechanism  is  called 
ilfn2rule  algorithm. 


1)  ILFN2RULE  Algorithm 

Using  a  grid-based  projection  method,  the  ilfn2rule  algorithm  is  used  to  map  a  trained 
ILFN  to  fuzzy  if-then  linguistic  rules. 

The  user  specifies  membership  functions’  types  (MF).  Any  type  of  fuzzy  membership 
functions  can  be  used.  In  this  study,  Gaussian  membership  functions  are  used.  The  rule 
extraction  algorithm  is  given  in  four  steps  described  below. 

Step  1:  Retrieve  trained  ILFN  parameters  (Wp,  WT,  count)  as  well  as  the  numbers  of 
linguistic  labels  (NL).  (The  numbers  of  linguistic  labels  are  determined  during 
the  genetic  optimization  process  that  will  be  discussed  later.) 

Step  2:  Calculate  membership  functions’  parameters  (MP)  that  are  a  center  and  a 
standard  deviation  for  each  linguistic  label  in  the  case  that  Gaussian  membership 
function  is  used.  Centers  of  Gaussian  functions  can  be  determined  from  the 
variables’  ranges  (VR)  that  are  minimum  and  maximum  values  of  the  numerical 
weight  Wp  for  the  ILFN  network. 


Vmirij 

=  min(vvp1_/,wp2y,” 

■,wrLj)  =  min(wp..), 

(14) 

VmaXj 

=  max  (  wpiy ,  wV2j ,  ■ 

■■,w?Lj)  =  max  ( wPjJ ), 

(15) 

Vmaxj  -  Vmirij 

(16) 

f  to  j  — 

N, -1 

where  i  =  1,  ...,  L\  j  =  1,  M\  L  is  the  number  of  hidden  nodes,  i.e., 
prototypes  created  by  the  ILFN  network;  AT  is  the  dimension  of  the  pattern  space; 
resj  represents  the  numerical  resolution  between  linguistic  variables  in  the  fh 
dimension;  Vmaxj  and  Vmirij  are  the  maximum  and  the  minimum  values  of  the 
weight  WP  in  the  /h  dimension;  and  Nj  is  the  number  of  linguistic  variables  in 
the fh  dimension. 


j  Vmirij  for  /:  =  1 

Vm-i  +  resj  for  k  =  2, ...,  AL 


(17) 


Vj  =  [Vn,V.n,...,\iJNj],  (18) 


^2xln  X)’  w 

where  ]xJk,  k  =  1,  ...,  Nj,  represents  the  mean  of  the  kth  Gaussian  membership 
function  in  the /h  dimension;  cry  represents  the  standard  deviation  of  the  Gaussian 
membership  functions  in  the  /h  dimension;  and  X,  selected  in  [0,  1],  represents 
the  overlap  parameter  between  membership  functions. 

Step  3:  Map  the  numerical  weight  WP  into  linguistic  label  form  using  the  following 
equation: 


res. 


Aj  =  arg  min(dist(wPiJ ,  cenjk )), 


(20) 
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where  Ay  represents  the  index  of  the  linguistic  label  mapped  from  wV!J ;  and  w?ij, 
for  i  =  1,  L,j  —  1,  k—1,  ...,  Nj,  is  an  element  of  the  hidden  weight  WP 

of  the  ILFN  network.  M  is  the  dimension  of  the  pattern  space  and  L  is  the 
number  of  prototypes  created  by  the  ILFN  network. 

Step  4:  Generate  if-then  rule  table:  use  linguistic  antecedent  parts  obtained  from  WP  and 
consequent  parts  from  Wf.  The  number  of  fuzzy  if-then  rules  is  equal  to  the 
number  of  hidden  neurons  of  the  trained  ILFN.  Calculate  confident  factor  CF„  i 
=  1,  ...,  L,  for  each  rule  using  count  parameter  count  using  the  following 
equation: 

count  =  [cnt\,  cnt2, cntL]r,  ->  (21) 


__  cnt, 

CFj  = 


Yacnth  ’ 

heClass(wjj) 


(22) 


where  cntj,  i  =  1,  . L,  is  a  count  parameter  of  the  zth  rule  (i.e.,  the  z'th  prototype) 
obtained  when  a  pattern  is  included  into  the  z*  prototype;  and  Class(wTi)=  {1 1 

^17  ^Ti  >  ^  !»•••>  L) . 


CF  =  [CF\,  CF2, ...,  CFiY 


(23) 


Linguistic  Label 

{1:  low,  2:  medium ,  3:  high } 


Direct  Mapping 


Knowledge  Base 


2  1 

1  3 

.2  2 

1  3 


3  3  2  2 

Antecedent  — Consequent 


Figure  6:  Mapping  from  ILFN  to  Linguistic  Rules 


The  knowledge  base  from  Figure  6  can  be  described  by  fuzzy  linguistic  form  that  is 
similar  to  natural  language  as  follows: 

Rule  1 :  If  featurei  is  high  and  feature2  is  low  and  features  is  medium ,  then  class  is  1 ; 

Rule  2:  If  feature]  is  medium  and  feature2  is  low  and  features  is  low,  then  class  is  3; 

Rule  3:  If  feature]  is  high  and  feature2  is  medium  and  features  is  medium,  then  class  is  2; 


After  linguistic  rules  are  extracted  from  the  ILFN  network,  they  can  be  used  as  a  rule 
base  for  a  fuzzy  expert  system.  A  frizzy  expert  system  is  considered  as  a  higher-level  knowledge 
representation  since  it  uses  if-then  rules  similar  to  natural  languages.  Using  linguistic  form 
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makes  the  system  transparent  allowing  human  users  to  easily  comprehend  the  rationale  of  how 
the  decision  was  made.  Explanations  and  answers  can  be  provided  if  needed. 

In  pattern  classification  problems,  the  dimension  of  the  pattern  space  may  be  very  large. 
For  very  large  dimensions,  it  is  too  cumbersome  to  use  all  the  features  available  as  a  knowledge 
base.  Though  it  is  described  in  linguistic  form,  using  all  available  features,  it  results  in  a  system 
that  is  no  longer  transparent  to  users.  It  is  possible  to  select  only  feature  subset  that  provides  the 
most  discriminatory  power  in  classifying  patterns.  To  do  this,  the  genetic  algorithm  is  very  useful 
and  suitable  to  select  the  important  features.  We  will  adapt  the  genetic  algorithm  (GA)  [40]  to 
search  for  an  optimal  number  of  features  used  for  each  rule  while  maintaining  a  high  percentage 
of  correct  classification.  This  will  result  in  reducing  the  number  of  rules  as  well.  Some  rules  will 
be  redundant  after  many  features  have  been  eliminated,  these  duplicate  rules  can  be  pruned  out. 

2)  Genetic  Algorithm  for  Rule  Optimization 

The  linguistic  rule  base  extracted  from  the  ILFN  is  sub-optimal.  In  order  to  obtain  a  near 
optimal  rule  set,  the  GA  is  used  to  operate  on  initial  fuzzy  rules.  An  integer  chromosome 
representation  is  used  instead  of  a  binary  chromosomes  representation,  to  reduce  the  size  of 
chromosome  and  improve  the  speed  of  the  evolutionary  operations.  The  fuzzy  if-then  rules  are 
encoded  into  integer  chromosomes  to  be  evolved  by  the  GA.  After  converging,  the  best 
chromosomes  are  decoded  back  into  the  FES  with  a  compact  rule  set. 

In  order  to  apply  the  genetic  optimization,  the  if-then  rule  base  is  encoded  in  a 
chromosome  representation.  Only  the  antecedent  is  coded  and  operated  on  by  the  evolutionary 
process.  The  original  rule  set  is  used  as  a  reference  rule  set  in  decoding  the  final  population  to 
the  final  linguistic  rule  base. 

3)  Fuzzy  If-Then  Rule  Encoding 

In  our  procedure,  only  the  antecedents  of  the  if-then  rules  are  used  in  genetic  encoding. 
If-then  rules  are  encoded  to  an  integer  chromosome.  A  chromosome  sometimes  refers  to  an 
individual  of  the  population.  The  elements  of  each  chromosome  are  called  genes  that  are  integer 
numbers.  Each  gene  in  a  chromosome  can  be  decoded  to  a  fuzzy  if-then  rule.  Let  Gr  be  a 
chromosome  that  is  a  set  of  genes  gi,  i  =  1,...,  L,  where 


Gr  -  {  g\,gi,  } 

(24) 

M 

gi=  £2  J-'xay 

j=\ 

(25) 

[0  if  Ay  =  0 

0,7=1  ,  . 

(26) 

J  [1  otherwise 

where  L  is  the  number  of  fuzzy  if-then  rules;  M  is  the  dimension  of  the  pattern  space.  The 
antecedent  Ay  is  the  linguistic  label  in  the y'th  dimension  of  the  zth  rules.  atJ  is  equal  “1”  meaning 
that  the  yth  dimension  of  the  zth  rule  is  being  used  and  ay  is  equal  “0”  meaning  that  the  y'th 
dimension  of  the  zth  rule  is  not  being  used,  i.e.,  don ’t  care. 

For  example,  a  FES  is  used  in  a  three-class,  two-dimensional  problem  space.  The  fuzzy 
expert  system  has  two  linguistic  labels  {1:  low,  2:  high)  in  each  dimension.  Suppose  that  there 
are  four  rules  extracted  from  a  trained  ILFN,  as  follows. 

Rule  1:  ifxi  is  low  and  X2  is  high,  then  class  1; 

Rule  2:  if  x\  is  high  and  xj  is  high,  then  class  2; 

Rule  3:  if  x\  is  low  and  xi  is  low,  then  class  3; 
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Rule  4:  if*]  is  high  and  x2  is  low,  then  class  3. 
That  is  we  have  R  =  {  A,  B  } , 


A  = 


rl  2" 

'i  ’ 

2  2 

< 

> ,  B  =  < 

2 

> 

1  1 

3 

2  1 

k  J 

3 

;  where  A  is  the  antecedent  set  and  B  is  the  consequent  set. 


Let  A  be  a  set  of  antecedents  when  some  features  are  composed  of  a  don ’t  care  linguistic 
label.  Let  a  be  a  binary  set  of  1  ’s  and  0’s  indicating  whether  or  not  an  element  of  A  is  used. 

1  2' 


Suppose  the  antecedent  set  A  is  reduced  to  A  = 


2  2 
0  0 
0  1 


That  is  we  have  a  binary  set  a  = 


1  1 
1  1 
0  0 
0  1 


From  the  antecedent  set  A,  we  have  an  encoded  chromosome 

Gr  =  {(2m  x  1+22’1  x  1),  (21’1  x  1+22"1  x  1),  (2M  x  0+22'1  x  0),  (2U1  x  0+22'1  x  1)} 

=  {  (1+2),  (1+2),  (0+0),  (0+2)  } 

=  {3, 3, 0,2}. 

4)  Fuzzy  If-Then  Rule  Decoding 

Integer  chromosomes  are  used  in  the  genetic  optimization  process.  After  convergence  of 
the  solution,  encoded  integer  chromosomes  are  decoded  back  to  fuzzy  if-then  rule  bases.  Given 
an  integer  chromosome,  each  gene  is  decomposed  into  binary  format.  The  decoding  process  is  an 
inverse  process  of  the  encoding  process  mentioned  above.  For  example,  suppose  that  we  have  a 
solution  chromosome  Gr  ={3,3,  0,  2} .  Gr  can  be  decomposed  to  binary  format  as  follows: 

Gr  =  {3,  3,  0,2} 

=  {  (1+2),  (1+2),  (0+0),  (0+2) } 


So,  we  have  a  =  { 


'1  f 

1  2 

1  1 

2  2 

*  0  0 

• .  Knowing  the  origin  antecedent  set  A  =  < 

> 

1  1 

0  1 

2  1 

,  we  have  the 


reduced  antecedent  A  = 


1  2 
2  2 
0  0 
0  1 


E  ={A,B} 


'1 

2 

r 

1 

2 

f 

2 

< 

2 

2 

>  =  < 

2 

2 

2  ► 

0 

0 

3 

0 

1 

3 

0 

1 

3 

J 
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Note  that  if  a  rule  comprises  of  all  don ’t  care  linguistic  labels  in  the  antecedent  part,  then  that 
rule  can  be  eliminated. 

5)  Genetic  Selection  for  the  Number  of  Linguistic  Variables 

The  number  of  linguistic  variables  can  be  varied  depending  on  a  given  problem.  Some 
problems  may  need  more  linguistic  variables  than  others.  Using  more  linguistic  variables  results 
in  finer  fuzzy  partitions  and  better  classification  performance.  However,  to  maintain  the 
interpretability  of  the  system,  the  number  of  linguistic  variable  should  be  as  small  as  possible. 
Selecting  of  the  numbers  of  linguistic  variables  becomes  a  trade  off  between  the  accuracy  of  the 
system  and  the  interpretability  of  the  system.  To  obtain  the  optimal  point  that  balances  between 
the  accuracy  and  the  interpretability  is  not  an  easy  task  in  selecting  the  number  of  linguistic 
variables.  To  avoid  the  difficulty,  the  numbers  of  linguistic  labels  can  be  selected  by  using  the 
genetic  algorithm.  The  genetic  selection  for  the  linguistic  numbers  can  be  processed 
simultaneously  with  the  rule  optimization. 

The  chromosome  for  the  genetic  optimization  of  the  number  of  linguistic  variables  (GNL) 
can  be  written  as 


Gnl—  {N\,Ni . Nm}  (27) 

where  Nj,  j  =  1,  ...,  M,  is  the  number  of  linguistic  variables  for  the  /h  dimension.  The 
chromosome  for  optimizing  the  number  of  linguistic  variables  (GnL)  can  be  combined  with  the 
chromosome  for  optimizing  the  fuzzy  if-then  rules  (GR).  The  combined  chromosome  from 
equations  (24)  and  (27)  can  be  written  as 


G  -  {Gnl,  Gr}  -  { NUN2, ..., NM,g\,gz,  -..,gL  }  (28) 

6)  The  Genetic  Algorithm 

When  the  genetic  algorithm  is  implemented,  it  usually  proceeds  in  a  manner  that  involves 
the  following  steps: 

Step  1  Initialization  of  the  population 

Step  2  Fitness  evaluation 

Step  3  Mate  selection 

Step  4  Crossover 

Step  5  Mutation 

Step  6  Check  stopping  criteria;  if  the  solution  meets  the  criteria,  stop  the  algorithm  and 
obtain  the  final  if-then  rules;  otherwise,  repeat  Steps  2-6. 

Initialization  of  the  population :  A  chromosome  has  two  different  groups  of  genes:  the 
number  of  linguistic  variables  and  the  fuzzy  if-then  rules.  The  initial  population  of  the 
chromosomes  is  randomly  selected  as  integer  numbers  in  both  of  the  groups.  These  initial 
individuals  will  be  reproduced  to  next  generation  via  the  genetic  operations:  fitness  evaluation, 
mate  selection,  crossover,  and  mutation. 

Fitness  evaluation :  The  fitness  function  is  based  on  the  performance  of  resulting  rules 
decoded  from  a  chromosome  and  the  compactness  of  the  rule  set.  A  fuzzy  expert  system  with  the 
decoded  rules  is  used  to  evaluate  the  performance  of  the  resulting  rules.  The  fitness  function  of  a 
chromosome  G  can  be  determined  from  the  following  equations: 
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fitness(G)  =  W?c  *  PC  -  WP  x  SC  -  PFnl  x  NL, 


(29) 


^  Total  Patterns  -  Wrongly  Classified  Patterns  , . . 

ru  =  - 1 - - - xlOO, 

Total  Patterns 

(30) 

SC  =  2>s, 

i.  i 

(31) 

II 

§ 

(32) 

j 


where  JVPC  is  the  weight  of  percent  correct  classification  by  a  fuzzy  expert  system;  PC  is  the 
percent  correct  classification;  WF  is  the  weight  of  the  number  of  features  used  for  a  rule  set;  ay  is 
calculated  from  (26);  SC  is  the  structure  complexity  of  the  fuzzy  system  i.e.,  number  of  features 
used  for  a  rule  set,  i.e.,  number  of  1  ’s  in  a;  PFNl  is  the  weight  of  the  number  of  linguistic 
variables  used  for  a  rule  set;  and  NL  is  the  summation  of  the  numbers  of  linguistic  variables  used. 
Preferring  fewer  linguistic  variables,  fewer  rules,  and  fewer  features  with  higher  correct 
classification  performance,  the  weight  of  percent  correctly  classified  patterns  (WPC),  is  usually 
set  to  be  relatively  larger  than  the  weight  of  structure  complexity  (WF)  and  the  weight  of  the 
linguistic  variables  (IFnl)-  W?,  Wpc,  and  W^l  are  all  positive  numbers  in  fft ;  they  are  predefined 
by  the  user. 

Mate  selection:  There  are  many  ways  of  selecting  individuals  for  mating.  One  of  the 
well-known  methods  is  roulette  wheel  selection  [49].  The  fittest  individuals  usually  have  a 
higher  chance  to  mate  than  the  ill-fitted  ones.  In  roulette  wheel  selection,  the  individuals  are 
randomly  selected  based  on  the  probability  of  fitness.  The  reproduction  probability  can  be 
defined  from  the  fitness  function. 


prob(Gi )  = 


fittness(G ,) 

_  , 

^fittness(G  j) 
j= i 


i  =  1,  ...,P, 


where  P  is  the  number  of  individuals  in  the  population. 


(33) 


Crossover  points 


▼  I _ i. 


Parentj  |_ 

2  3 

4  5 

23  0  6  12 

Parent2 

3  4  5  2 

42  7  8  10 

Offspring] 

3  4 

4  5 

42  7  8  12 

Offspring2 

2  3 

l _ 

5  2 

_ J 

23  0  610 

^  J 

Gnl 

V 

Gr 

Figure  7:  Crossover  Operation 

Crossover.  After  mate  selection  operation,  crossover  operation  is  performed.  Crossover 
operation  is  a  mechanism  for  changing  information  between  two  chromosomes  called  parents  to 
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reproduce  two  new  individuals  called  offspring.  A  crossover  point  is  selected  randomly  with 
probability'  pc.  In  our  problem,  since  a  chromosome  is  separated  into  two  groups,  the  crossover 
process  is  also  separated  into  two  parts:  the  crossover  of  the  number  linguistic  variables  and  the 
crossover  of  fuzzy  if-then  rules.  The  crossovers  of  the  two  parts  are  independent  from  each  other. 
Figure  7  illustrates  how  two  chromosomes  crossover,  yielding  two  offspring. 

Mutation-.  Mutation  is  applied  to  offspring  to  prevent  the  solution  from  trapping  at  a  local 
minimum  area.  The  mutation  operation  allows  the  genetic  algorithm  to  explore  new  possible 

solutions  and  increase  a  chance  to  get  near  global  minima.  Figure  8  illustrates  how  the  mutation 
operation  works. 

Mutation  points 

i  i 

Offspring,  3  4  4  5 

Offspring,  3  4  2  5 

l 

®NL 

Figure  8:  Mutation  Operation 


42  7  8  12 
64  7  8  12 


An  offspring  chromosome  mutates  with  the  mutation  probability  pm  on  each  gene.  In  the 
integer-coded  genetic  algorithm,  the  mutation  process  operates  from  the  following  equation ' 

Gnew  =  round (G0id  +  y  x  randn(l)), 

where  Gold  is  a  gene  selected  for  mutating;  Gnew  is  the  resulted  gene  from  mutating;  randn(l)  is  a 
random  number  in  [0,  1]  produced  by  the  Gaussian  random  generator;  and  y  is  the  highest 
possible  integer  value  a  gene  is  allowed  to  be.  The  two  parts  of  the  chromosome  G  have  different 
values.  The  highest  possible  value  of  the  number  of  linguistic  variables  is  set  to  9  or  smaller  The 
highest  integer  value  for  the  gene  of  the  if-then  rule  is  2M  for  M-dimensional  space. 

D.  Rule-To-Network  Module 

The  rule_t°-network  module  is  used  for  transferring  the  linguistic  knowledge  into  the 
1TFN  structure.  The  rule-to-network  module  allows  an  expert  to  incorporate  his  knowledge  into 

te  system.  The  rule-to-network  consist  of  the  rule2ilfn  algorithm  that  is  used  for  mapping  the 
rbb  to  the  ILFN. 


1)  RULE2IIFN Algorithm 

There  are  two  phases  in  the  rule2ilfii  algorithm.  Phase  1  is  used  when  frizzy  rules  are 
compact  where  they  have  don 't  care  linguistic  variables.  The  don ’t  care  linguistic  variables  need 
to  be  transformed  into  intermediate  rules.  In  the  transformed  intermediate  rules,  every  feature  or 
component  of  the  rules  is  composed  of  at  least  a  linguistic  variable  attached;  otherwise  we 
cannot  map  rules  to  an  ILFN  network.  Phase  2  operates  after  phase  1  ended.  In  phase  2,  the 
parameters  of  fuzzy  rules  are  correspondingly  mapped  to  the  parameters  of  an  ILFN  network. 
The  details  of  the  two  phases  are  as  follows. 
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Phase  1:  Mapping  a  compact  rule  set  to  an  intermediate  rule  set. 

1)  Retrieve  a  rule  R(  =  [An,  Ai2,. . .,  AiM,  Bt,  CFt } ;  i  =  1, . . L. 

2)  Check  for  a  feature  that  has  a  don ’t  care  linguistic  label  (i.e..  A,;,-  =  0).  Within 
the  present  rule,  if  there  is  a  feature  having  a  don 't  care  linguistic  label; 
expand  every  possible  rule  to  cover  the  combinations  of  available  linguistic 
labels. 

3)  Repeat  1)  and  2),  until  there  are  no  more  rules. 

4)  Output  the  intermediate  rule  set. 

Phase  2:  Mapping  an  intermediate  rule  set  to  an  initial  ILFN  network. 

■  Set  WP,  Wx,  STD,  and  count  to  be  empty  sets. 

■  For  Ith  rule  =  1  to  L  do, 

•  For /h  feature  =  1  to  M  do, 

O  Set  Wpy  =  p,y 
o  Set  stdy  =  ay 

•  Set  wv  =  Bi 

•  Set  cnti  =  1 

■  WP  =  [wpi,  wP2, . . wPJT;  where  wPi-  =  [wPn,  wPa>. . wP/A/] 

■  WT=  [wxi,  WT2,  Wti]T 

■  STD  =  [or  1 ,  CT  2, . . . ,  a  i]T;  where  a,  =  [a,i ,  0,7, ... ,  a,M] 

■  count  =  [cnt\,cnt2,. . . ,cnti ]T 

L  is  the  number  of  rules  and  M  is  the  dimension  of  pattern  space.  The  parameters  and  ay,  i  =1, 
...,  L,j  =  1,  ...,  M,  are  a  mean  and  a  standard  deviation  of  the  linguistic  label  indexed  by  Ay. 
More  specifically,  p,y  and  ay  are  taken  from  mp^  that  is  in  the  membership  functions’ 
parameters  (MP)  from  Equation  (7). 

After  obtaining  the  initial  ILFN  network,  available  training  data  is  used  to  refine  the 
ILFN  network.  Network  pruning  is  also  needed  to  eliminate  the  hidden  nodes  that  do  not  have 
any  belonging  pattern.  This  can  be  done  by  checking  at  the  parameter  count.  If  count  of  a  node  is 
equal  to  one,  then  eliminate  that  node. 

E.  The  Decision-Explanation  Module 

The  last  module  in  the  HIS  is  the  decision-explanation  module.  The  decision-explanation 
module  performs  two  functions:  making  a  decision  and  explaining  the  decision.  For  a  first 
function,  making  a  decision,  the  decision-explanation  module  receives  two  inputs  from  the 
outputs  of  low-level  ILFN  and  higher-level  FES.  Another  function  of  the  decision-explanation 
module  is  to  generate  a  natural  language  to  explain  and  conclude  the  decision  made  by  using  the 
knowledge  base  (K)  from  the  FES  module. 


Figure  9:  Hybrid  System  Combined  from  ILFN  and  FES 
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Figure  9  shows  an  equivalent  of  the  HIS.  The  decision  for  the  class  output  C,  can  be 


calculated  by  the  following  equations: 

C  =  [Ci,  C2,--‘,  Cq] 

(35) 

[  ^al  *  ^a.2  >”*5  1 

(36) 

=  [  ^p!  x  C!Fj ,  (|)p2  x  CF2 ,  •  •  * ,  (j)j3g  X  CFq  ] 

(37) 

yt 

=  - f - ,forz  =  l,  ...,Q 

a, +p, 

(38) 

y 

'  =  [y\,y2,---,yQ] 

(39) 

r 

w. y 

=  Cj\J=  argmax  (y,) 

(40) 

where  C  is  the  class  vector;  Oa  is  the  membership  values  from  the  ILFN  with  respect  to  C;  Op 
is  the  membership  values  from  the  FES  with  respect  to  C;  y  is  the  membership  values  from  the 
HIS  with  respect  to  C;  a  =  [ai,  a2,  a q]  and  P  =  [pi,  p2,  ....  Pg]  are  the  real-value  weights 
linking  from  the  ILFN  and  the  FES  to  the  decision-explanation  module,  respectively;  0  is  the 
number  of  classes;  and  Cy  is  the  class  decision  output  from  the  HIS.  Please  note  that  a  and  P  can 
be  specified  by  used  or  determined  by  an  optimization  algorithm  such  as  the  GA.  In  our  study  we 
used  a  real  GA  to  search  for  possibly  optimal  values  of  a  and  p. 

For  simplicity,  we  may  set  a  =  ai  =  a2  =  . . .  =  ag  and  P  =  pi  =  p2  =  ...  =  Pg.  Then  we 

have 


+pOB 
q  +  p 


(41) 


- ILFN  Boundaries - FES  Grid  Partitions 


Hybrid  Boundaries 


Figure  10:  Hybrid  Decision  Boundaries 
1)  Decision  Boundaries 

The  decision  boundaries  of  the  HIS  come  from  the  weighted  average  of  the  boundaries 
from  the  ILFN  and  FES.  The  decision  boundaries  of  the  ILFN  and  the  decision  boundaries  of  the 
FES  contribute  in  different  manners.  The  decision  boundaries  of  the  ILFN  emphasize  in  local 
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area  to  achieve  better  generalization  while  the  decision  boundaries  of  the  FES  preserve  for 
human  interpretability.  Since  numerical  information  in  the  ILFN  and  linguistic  information  in  the 
FES  have  complement  benefits,  it  is  preferable  to  incorporate  both  structures  into  the  same 
system.  The  boundaries  of  the  hybrid  system  provide  both  accuracy  and  interpretability.  In  the 
HIS,  the  ILFN  serves  as  a  low-level  numerical  computation,  while  the  FES  operates  as  a  higher- 
level  linguistic  computation.  Hybrid  weights  (a  and  (3)  play  important  role  in  adjusting  the 
hybrid  decision  boundaries.  If  a,-  is  larger  than  J3„  the  hybrid  boundaries  tend  toward  the  ILFN 
boundaries.  If  a,-  is  smaller  than  P„  the  hybrid  boundaries  tend  toward  the  FES  boundaries. 
Figure  10  shows  the  hybrid  decision  boundaries  of  the  combined  ILFN  and  FES. 

2)  Conflict  and  Conflict  Resolution  Between  Low  Level  and  Higher  Level 

Ideally,  there  should  be  no  conflict  between  low  level  and  higher  level  decisions  in  the 
HIS,  if  it  is  a  one-to-one  mapping  between  them.  Since  a  combination  of  grid  based  partition  and 
projection  is  used  in  the  mapping  process,  decisions  from  ILFN  and  FES  may  conflict  with  each 
other.  The  diagram  in  Figure  1 1  shows  the  possible  conflict  decisions  between  the  ILFN  and  the 
FES.  The  system  is  conflict,  if  the  decision  from  the  ILFN  is  correct  but  the  decision  from  the 
FES  is  wrong  or  if  the  decision  from  the  ILFN  is  wrong  but  the  decision  from  the  FES  is  correct. 
The  system  is  not  conflict,  if  the  two  systems  make  the  same  decision.  It  is  preferable  that  the 
two  systems  are  not  conflict  and  both  make  correct  decisions. 

It  is  feasible  to  resolve  the  conflict  between  the  two  systems  by  forcing  their  decision 
boundaries  to  be  as  close  to  the  HIS  decision  boundaries  as  possible.  Conflict  resolution  for  the 
HIS  is  then  the  determination  of  the  elements  for  a  and  p.  In  the  matter  of  fact,  this  becomes  an 
ordinary  optimization  problem,  which  can  be  solved  by  using  any  optimization  method.  Due  to 
an  advantage  of  not  requiring  a  derivative  calculation  and  unlikely  to  be  trapped  at  local  minima, 
GA  can  be  adopted  for  this  purpose.  The  GA  searches  for  weights  a  and  p  that  adapt  the 
decisions  boundaries  of  the  two  modules  to  be  closer  together.  The  hybrid  decision  finally  will 
be  forced  to  the  diagonal  path  indicated  as  a  dashed  line  in  Figure  11,  which  ideally  shows  that 
the  ILFN  and  the  FES  are  not  conflicting  and  both  make  correct  decisions. 


Decision  from  ILFN 


Figure  1 1 :  Conflict  Decision  between  the  ILFN  and  the  FES 


F.  Increment  Learning  Characteristic  of  the  Proposed  HIS 

An  incremental  learning  system  updates  its  new  knowledge  without  training  old  data. 
Only  new  data  is  needed  in  the  learning  process.  This  concept  has  been  studied  by  many 
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researchers  (see  [35]-[38].)  In  the  hybrid  structure  between  a  low  level  and  a  higher  level,  such 
as  in  numerical  and  symbolic  or  numerical  and  linguistic  systems,  it  is  preferable  to  incorporate 
the  incremental  feature  to  the  system.  It  is  important  in  application  such  as  controls  and 
monitoring  process,  as  well  as  in  medical  diagnosis,  to  employ  an  incremental  learning  aspect. 
New  knowledge  needs  to  be  captured  in  real  time  without  spending  tremendous  time  tolearn  all 
the  old  data  along  with  the  new  data. 

Since  the  proposed  HIS  incorporated  the  ILFN  which  is  an  incremental  learning 
architecture,  it  is  easy  to  employ  its  incremental  learning  capability.  The  ILFN  can  leam  all 
patterns  within  only  one  pass.  While  operating,  the  ILFN  detects  new  unseen  class  prototypes.  If 
new  knowledge  is  found,  the  new  knowledge  is  added  in  the  hidden  unit  without  destroying  the 
old  knowledge.  To  extend  the  incremental  learning  feature  to  the  higher-level  linguistic  model  is 
straightforward.  New  linguistic  rules  can  be  directly  extracted  from  the  new  hidden  nodes  of  the 
ILFN  by  using  the  ilfii2rule  in  the  network-to-rule  module.  An  algorithm  for  checking  conflict  is 
operated  to  maintain  consistency  between  the  two  levels.  Similarly,  if  the  higher  level  has  a  new 
knowledge,  i.e.,  linguistic  rules  that  maybe  come  from  an  expert  or  experienced  users,  the  new 
knowledge  needed  to  be  mapped  to  the  ILFN  structures  as  well.  This  can  be  done  by  using  the 
rule2ilfn  algorithm  in  the  rule-to-network  module. 

in.  SIMULATION  RESULTS 

To  demonstrate  the  performance  of  the  HIS,  computer  simulations  were  used  in  our 
study.  Simulations  and  analysis  of  the  HIS  are  performed  using  the  well-known  benchmark  data 
namely  Wisconsin  breast  cancer  database  (WBCD)  [50],  as  an  example  for  application  in 

medical  diagnosis.  The  WBCD  is  a  real  application  that  has  been  used  by  many  researchers  r  121 
[23],  [26],  [29].  L  J’ 

The  WBCD  contains  a  collection  of  699  patterns  each  described  by  9  features  Each 
feature  is  a  real  number  in  the  interval  1  to  10  based  on  a  fine  needle  aspirate  taken  directly  from 
human  breasts:  clump  thickness,  size  uniformity,  shape  uniformity,  marginal  adhesion,  cell  size, 
bare  nuclei,  bland  chromatin,  normal  nucleoli  and  mitosis.  The  larger  the  values  of  these 
attributes  yield  the  greater  the  likelihood  of  malignancy.  There  are  458  patterns  for  benign 
(labeling  as  “2”  in  the  data  base)  and  241  patterns  for  malignant  (labeling  as  “4”)  There  are  16 
patterns  with  incomplete  feature  descriptions  marked  as  “?”  [50],  [51],  We  replaced  the  missing 
values  with  “0.” 

A.  Simulation  Results  for  the  WBCD 

Ten  simulations  were  performed  to  evaluate  the  proposed  method.  In  every  simulation, 
the  ILFN  learning  parameters  were  set  to  defaults  as  follows:  the  threshold,  s,  =  0  and  the 
standard  deviation,  Co,  —  0.5.  The  numerical  weights  of  the  ILFN  network  were  extracted  to 
fuzzy  initial  linguistic  rules.  In  order  to  optimize  the  linguistic  rules,  the  GA  with  the  integer 
chromosome  representation  was  used  by  setting  its  learning  parameters  heuristically  as  follows: 
population  size  =  100,  the  number  of  generations  =  100,  the  mutation  probability,  pm  =  0.8,  and 
the  crossover  probability,^  =  0.01.  The  weights  in  the  fitness  evaluation  are  set  as  follows:  WPC 
-  50,  Wp  —  5,  and  JTnl=  1.  The  number  of  linguistic  labels  was  constrained  to  within  3  for  each 
dimension.  The  GA  with  a  real  chromosome  representation  also  was  used  to  find  the  weighting 
parameters,  a  and  p.  The  parameters  for  the  real  GA  were  as  follows:  population  size  =  60,  the 
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number  of  generations  =  20,  the  mutation  probability,  pm=  0.8,  and  the  crossover  probability,  pc 
=  0.01.  The  results  from  the  ten  simulations  are  shown  in  Table  1. 


TABLE  1 

The  Simulation  Results  for  the  WBCD 


Run  no. 

Methods 

Structure  complexity 

Number  of  patterns 

%  Correctly  classified  patterns 

#Nodes 
and/or  #Rules 

#  conditions* 

#  Training 

#  Test 

Training  set 

%  Test  set 

%  Overall 
patterns 

1 

Numerical  (ILFN) 

3  nodes 

9  F/N 

100 

599 

98% 

96.83% 

97.00% 

Fuzzy  Rules 

100 

599 

98% 

96.49% 

96.71% 

Hybrid  ILFN  and 
Fuzzv  Rules 

9  F/N  &  2.3  F/R 

100 

97.00% 

2 

Numerical  (ILFN) 

3  nodes 

9  F/N 

100 

599 

98% 

minim 

Fuzzy  Rules 

3  rules 

4  F/R 

100 

599 

98% 

IBMI 

Hybrid  ILFN  and 
Fuzzv  Rules 

3  nodes  &  3 
rules 

9  F/N  &  4  F/R 

100 

599 

98% 

97.14% 

97.57% 

3 

Numerical  (ILFN) 

3  nodes 

9  F/N 

342 

98% 

imam m 

97.23% 

Fuzzy  Rules 

3  rules 

2.7  F/R 

341 

342 

97.95% 

KIl-lW 

96.71% 

Hybrid  ILFN  and 
Fuzzy  Rules 

3  nodes  &  3 

rules 

9  F/N  &  2.7  F/R 

341 

342 

98% 

98.25% 

98.13% 

4 

Numerical  (ILFN) 

4  nodes 

9  F/N 

120 

358 

95.83% 

97.49% 

96.57% 

Fuzzy  Rules 

4  rules 

•  3.75  F/R 

341 

358 

97.36% 

96.65% 

97.00% 

Hybrid  ILFN  and 
Fuzzy  Rules 

4  nodes  &  4 
rules 

9  F/N  &  3.75  F/R 

341 

358 

97.07% 

97.50% 

97.43% 

5 

Numerical  (ILFN) 

4  nodes 

9  F/N 

120 

342 

95.83% 

96.93% 

Fuzzy  Rules 

IKErBS 

2.67  F/R 

341 

342 

96.77% 

96.79% 

Hybrid  ILFN  and 
Fuzzv  Rules 

4  nodes  &  3 

rules 

9  F/N,  2.67  F/R 

341 

342 

96.48% 

98.25% 

97.36% 

6 

Numerical  (ILFN) 

HiXSBlSSI 

9  F/N 

150 

342 

94.67% 

96.63% 

Fuzzy  Rules 

BSH 

2.2  F/R 

341 

342 

97.07% 

msEs^m 

97.22% 

5  nodes,  5 
rules 

9  F/N  &  2.2  F/R 

341 

342 

97.07% 

97.08% 

97.07% 

7 

5  nodes 

9  F/N 

150 

342 

94.67% 

■MM 

96.63% 

4  rules 

2.5  F/R 

683 

683 

97.07% 

97.07% 

97.07% 

Hybrid  ILFN  and 
Fuzzv  Rules 

5  nodes  &  4 
rules 

9  F/N  &  2.5  F/R 

683 

683 

97.22% 

97.22% 

97.22% 

8 

Numerical  (ILFN) 

3  nodes 

9  F/N 

100 

342 

98% 

wmwm 

97.23% 

Fuzzy  Rules 

2  rules 

3  F/R 

683 

683 

97.23% 

97.23% 

Hybrid  ILFN  and 
Fuzzy  Rules 

3  nodes  &  2 
rules 

9  F/N  &  3  F/R 

683 

683 

97.57% 

97.57% 

97.57% 

9 

mmsssmssnsm 

9  F/N 

150 

549 

94.67% 

97.19% 

96.42% 

Fuzzy  Rules 

bush 

2.5  F/R 

699 

699 

97.57% 

97.57% 

97.57% 

Hybrid  ILFN  and 
Fuzzy  Rules 

5  nodes  &  4 
rules 

9  F/N  &  2.5  F/R 

699 

699 

97.57% 

97.57  /i 

97.57% 

Numerical  (ILFN) 

3  nodes 

9  F/N 

100 

599 

98% 

96.83% 

97.00% 

10 

Fuzzy  Rules 

3  rules 

■HQEEISdHIi 

699 

699 

96.85% 

96.85% 

96.85% 

iiiii 

3  nodes  &  3 
rules 

699 

699 

97.42% 

97.42% 

97.42% 

Average 

■aBBBBBBBSai 

3.8  nodes 

9  F/N 

96.17% 

96.77% 

Fuzzy  Rules 

3.4  rales 

2.77  F/R 

97.08% 

Hybrid  ILFN  and 
Fuzzv  Rules 

3.8  nodes  & 
3.4  rules 

9  F/N  &  2.77  F/R 

97.61% 

97.48% 

97.55% 

*  F/N  *  #  features  per  node  and  F/R  =  #  features  per  rule. 


From  Table  1,  the  ILFN  achieved  an  average  correct  classification  of  96.17%  on  training 
set  and  97.37%  on  test  set.  The  fuzzy  rules  extracted  from  the  trained  ILFN  achieved  an  average 
correct  classification  of  97.43%  on  training  set  and  96.72%  on  test  set.  It  is  worth  noting  that  the 
fuzzy  rules  extracted  from  the  trained  ILFN  achieved  higher  classification  rate  for  the  training 
set.  However,  the  fuzzy  rules  achieved  lower  percentage  of  correctly  classified  patterns  from  the 
test  set.  When  we  combined  the  ILFN  and  fuzzy  rules  extracted  to  construct  a  HIS,  the  results 
show  that  the  HIS  achieved  higher  classification  rate  than  both  the  ILFN  and  the  extracted  fuzzy 
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rules  alone.  The  proposed  HIS  had  an  average  of  97.61%  and  97.48%  correct  classification  on 
the  training  set  and  the  test  set,  respectively. 

Due  to  the  space  limitation,  we  show  the  details  of  numerical  weights  of  the  ILFN  and 
the  extracted  linguistic  rules  from  one  example  based  on  the  best  classification  performance  of 
the  HIS,  i.e.,  from  run  number  3  in  Table  1.  Based  on  run  number  3,  the  details  on  ILFN  and  its 
linguistic  rules  extracted  are  shown  in  Tables  2, 3,  and  4. 

From  run  number  3,  we  used  100  patterns  for  training  the  ILFN,  341  patterns  for  training 
to  the  FES  and  HIS,  and  used  342  patterns  for  testing  in  all  three  systems.  The  ILFN  constructed 
3  hidden  nodes  with  the  parameters  shown  in  Table  2.  The  ILFN  network  achieved  98%  and 
97.95%  correct  classification  for  the  training  set  and  the  test  set,  respectively. 

From  Table  2,  the  knowledge  embedded  in  the  trained  ILFN  is  in  numerical  form. 
Linguistic  rules  are  preferably  extracted  from  the  trained  ILFN  for  a  reasoning  purpose.  The 
fuzzy  linguistic  rules  are  mapped  from  the  ILFN  parameters  and  the  GA  is  used  to  select  only 
discriminatory  features.  This  will  be  resulted  in  a  more  compact  rule  set. 

TABLE  2: 

ILFN  Parameters  for  the  WBCD 


WP  s 

WT 

2.7818 

1.3455 

1.4182 

1.2727 

2.0545 

1.5273 

2.7818 

1.1818 

1.0909 

2 

7.3462 

mu 

6.6154 

5.1923 

7.5 

5.5385 

6.9615 

3.4615 

4 

6.6316 

3.7895 

4.3684 

2.6842 

3.8947 

4 

■EE&B 

4.4211 

2.0526 

4 

- - - 

Standard  Deviation  i 

count 

8.0293 

4.1716 

2.6234 

2.3358 

5.2169 

7.9414 

2.2247 

1.6642 

55 

8.9208 

9.4657 

7.9145 

12.538 

9.2651 

9.8717 

7.2446 

10.184 

13.132 

26 

8.9898 

4.0137 

4.2122 

4.8625 

5.2584 

7.6044 

3.4663 

8.9787 

7.9811 
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TABLE  3: 

Resulted  Linguistic  Labels  and  Their  Parameters  for  WBCD 


Features 

Linguistic  Labels  and  Parameters 

F,  =  Clump  Thickness 

1:  low , 

2:  high , 

(Gaussian:  7.3462,  1.504) 

F2  =  Size  Uniformity 

1:  low  2 

2:  high  2 

(Gaussian:  1.3455, 1.7491) 

KsEEuaunznn 

F3  =  Shape  Uniformity 

1 :  low  3 

(Gaussian:  1.4182,0.85625) 

(Gaussian:  4.0168,  0.85625) 

F4  =  Marginal  Adhesion 

1:  low  A 

2:  high  „ 

(Gaussian:  1.2727, 1.2915) 

(Gaussian:  5.1923,  1.2915) 

F5  =  Cell  Size 

1 :  low  5 

3:  high  5 

(Gaussian:  2.0545,0.77676) 

(Gaussian:  4.4119,0.77676) 

(Gaussian:  6.7692,  0.77676) 

F6  =  Bare  Nuclei 

1:  low  6 

2:  high  6 

(Gaussian:  1.5273,  1.968) 

(Gaussian:  7.5, 1.968) 

1 :  low  7 

3:  high-j 

(Gaussian:  2.7818,  0.45416) 

(Gaussian:  4.1601,  0.45416) 

(Gaussian:  5.5385,  0.454)6) 

Fg  =  Normal  Nucleoli 

1 :  low  8 

3:  highs 

(Gaussian:  1.1818,0.95222) 

(Gaussian:  4.0717, 0.95222) 

(Gaussian:  6.9615, 0.95222) 

F9  =  Mitosis 

1 :  low  9 

(Gaussian:  1.0909,0.78113) 

*  Since  Gaussian  membership  functions  are  used,  the  parameters  of  the  lingusistic  lables  are  written  as  (Gaussian:  mean,  standard  deviation). 
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TABLE  4: 

Fuzzy  Expert  Rules  for  Wisconsin  Breast  Cancer  Data 


Antecedent 


Consequent 


F, 


F, 


Class 


CF 


1 


0.57778 


0.42222 


Fj 


Fz 


f7 


Rule  1  0.5 

0>L 

1r 

Rule  2  0.5 


1  1 

0.5 


0.5 


i; 


P-5 


1 

0.5 


Q— 


40,0 


10  . 


0.5 
-  0 


i  1 


10,p 


0.5 


Rule  3  o.5 
0 


0.5 
-  0 


2  4  6 


0.5 

0 


-  0 
1 

0.5 
-  0 


10 


,2  4  6 


0.5 
-  0 


2  4  6 


0.5 

-  O' 
1 

0.5 

0 


3  4  5 


1 

0.5 

0 

1( 

0.5 


i; 


10,11 


/  V 


2  4  6 


2  4  6 


0.5 

0 

1( 

0.5 

0 


Class  CF 


4  0.57778 


4  0.42222 


After  running  for  100  generations,  the  resulting  fuzzy  linguistic  rules  are  shown  in  Table 
3.  The  fuzzy  linguistic  rules  are  shown  in  Table  4.  Using  the  linguistic  knowledge  from  Tables  3 
and  4  as  the  rule  set  for  a  fuzzy  expert  system,  the  final  fuzzy  linguistic  rules  achieved  97.95% 
and  96.49%  correct  classification  for  training  and  testing  data,  respectively.  The  hybrid 
intelligent  system  combining  the  decisions  from  both  ILFN  and  FES  achieved  98%  correct 
classification  rate  for  training  set  and  98.25%  correct  classification  rate  for  the  testing  set.  The 
HIS  achieved  98.13%  in  all  683  patterns  of  the  WBCD.  Fuzzy  expert  rules  in  natural  language 
for  the  WBCD  can  be  interpreted  as 

Rulel :  If  Clump  Thickness  is  low\  and  Size  Uniformity  is  low?  and  Marginal  Adhesion  is 
low4  and  Cell  Size  is  low5  and  Normal  Nucleoli  is  low%  and  Mitosis  is  low%  Then 
Malignant,  with  confidence  =  1 ; 

Rule2:  If  Clump  Thickness  is  high]  and  Shape  Uniformity  is  high?,  and  Mitosis  is  highg, 
Then  Benign,  with  confidence  =  0.58; 

Rnle3:  If  Clump  Thickness  is  high]  and  Bland  Chromatin  is  medium?  and  Mitosis  is  low9, 
Then  Benign,  with  confidence  =  0.42; 

B.  Comparison  Results  Among  Other  Methods 

Several  groups  of  researchers  have  studied  and  developed  knowledge-based  system  for 
the  WBCD.  Pena-Reyes  and  Sipper  used  a  fuzzy  if-then  system  as  a  classifier.  They  developed  a 
fuzzy-GA  algorithm  to  extract  rules  from  the  WBCD.  Fuzzy-GA  algorithm  uses  the  genetic 
algorithm  (GA)  to  search  for  two  parameters,  P  and  d,  of  their  fuzzy  rules  [12].  The  number  of 
rules  has  to  be  predetermined  in  ad  hoc  manner.  In  [23],  Setiono  developed  a  rule  extraction 
called  NeuroRule.  NeuroRule  uses  a  pruning  procedure  after  the  training  phase  to  decrease  the 
number  of  the  network  connections.  The  pruning  process  runs  until  network  performance  drops 
to  95%  correct  classification  rate.  In  [23],  100-MLP  networks  were  used  in  the  training  phase. 
The  network  with  highest  performance  out  of  100  pruned  networks  was  use  in  rule-extraction 
phase.  The  NeuroRule  extracts  rules  by  clustering  the  hidden  nodes  activation  values.  Then,  the 
input  combinations  are  checked  if  any  input  makes  the  hidden  nodes  and  output  node  active.  An 
improvement  of  NeuroRule  in  the  WBCD  was  studied  by  the  same  author  in  [29]  by  doing  data 
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pre-process  before  the  training  step.  Another  group  was  Taha  and  Ghosh  [26].  In  [26],  three  rule 
extraction  algorithms  were  developed:  BIO-RE,  Partial-RE,  and  Ful-RE.  BIO-RE  is  a  black  box 
rule  extraction  technique  which  does  not  require  information  regarding  the  internal  network 
structure  to  generate  rules.  Partial-RE  searches  for  a  set  of  incoming  connections  that  will  cause 
a  unit  to  be  active.  Full-RE  decompose  the  rule  extraction  process  into  two  steps:  rules  between 
hidden  and  output  units  and  rules  between  input  units  and  hidden  units.  This  is  similar  to 
NeuroRule  [29]  but  the  difference  is  that  Full-RE  employs  linear  programming  and  an  input 
discretization  method  to  find  a  combination  of  the  input  values  that  will  cause  a  hidden  unit  to  be 
active.  Comparison  results  on  the  WBCD  are  shown  in  Table  5,  which  shows  the  comparison 
among  several  rule-based  systems  from  [12],  [23],  [26],  [29]. 

TABLE  5: 

Comparison  Results  for  the  WBCD  Among  Well-known  Methods 


Methods 

Representaion 

Type 

Rule  complexity 

Number  of  patterns 

Performance  Evaluation 

#  Rules 

#F/R* 

#  Training 

#  Test 

%  T raining  corn 

%  Test  corr. 

%  Overall  corr. 

Setiono  [23] 

Boolean  Rules 

1  +  default 

2 

350 

349 

96.86% 

93.98% 

95.42% 

Setiono  [23] 

Boolean  Rules 

2  +  default 

!  4 

350 

349 

97.71% 

96.56% 

97.14% 

Setiono  [29] 

Boolean  Rules 

1  +  default 

4 

341 

342 

97.07% 

97.66% 

97.36% 

Setiono  [29] 

Boolean  Rules 

3  +  default 

3.7 

341 

342 

97.95% 

98.25% 

98.10% 

Setiono  [29] 

Boolean  Rules 

4  +  default 

1 

341 

342 

97.07% 

97.66% 

97.36% 

Setiono  [29] 

Boolean  Rules 

5  +  default 

4.2 

341 

342 

’  98.53% 

97.95% 

98.24% 

Setiono  [29] 

Boolean  Rules 

6  +  default 

1.7 

341 

342 

97.95% 

98.25% 

98.10% 

Pena  and  Sipper  [12] 

Fuzzy  Rules 

1  +  default 

4 

341 

342 

97.07% 

Pena  and  Sipper  [12] 

Fuzzy  Rules 

2  +  default 

3 

341 

342 

97.36%  . 

Pena  and  Sipper  [12] 

Fuzzy  Rules 

3  +  default 

4.7 

341 

342 

97.80% 

Pena  and  Sipper  [12] 

Fuzzy  Rules 

4  +  default 

4.8 

341 

342 

97.80% 

Pena  and  Sipper  [12] 

Fuzzy  Rules 

5  +  default 

3.4 

341 

342 

97.51% 

Taha  and  Ghosh  [26] 

Boolean  Rules 

1 1  +  default 

2.7 

341 

342 

97.07% 

96.20% 

96.63% 

Taha  and  Ghosh  [26] 

Boolean  Rules 

9  +  default 

2.67 

341 

342 

97.07% 

95.91% 

96.49% 

Taha  and  Ghosh  [26] 

Boolean  Rules 

5 

1.8 

341 

342 

96.77% 

95.61% 

96.19% 

This  study 

*  F/R  =  #  featnrpc  rw*r  ml#> 

Numerical  (ILFN) 

(3  nodes) 

(9  F/N) 

100 

342 

98% 

97.95% 

97.23% 

Fuzzy  Rules 

3 

2.7 

341 

342 

97.95% 

96.49% 

96.71% 

Hybrid  ILFN  and 
Fuzzv  Rules 

■  and  F/NI  c  M 

(3  nodes  &  3 
rules) 

(9  features  & 
2.7  F/R) 

341 

342 

98% 

98.25% 

98.13% 

From  Table  5,  the  best  performance  was  from  NeuroRule  [29]  with  5  rules  plus  a  default 
rule  extracted  from  one  of  the  100  pruned  networks  with  2  hidden  units  and  9  connections.  The 
accuracy  rate  was  98.24%  in  683  patterns.  The  rule  set  extracted  in  [29]  is  as  follows: 

If  F2  <4  and  F6  <  2  and  Fs  <  2,  then  benign, 

Else  if  F2  <  4  and  Fg  <  2  and  Fs  <  8  and  Fi  <  6,  then  benign, 

Else  if  Fi  <  5  and  F4  <  4  and  F$  <  5  and  Fs  <  2,  then  benign, 

Else  if  Fi  <  6  and  F2  <  4  and  F6  <  6  and  Fs  <  8,  then  benign, 

Else  if  F2  <  4  and  F4  <  5  and  F6  <5  and  3  <  F6  <  5  and  F8  <  8,  then  benign, 

Else  malignant 

NeuroRule  does  not  produce  any  rule  for  malignancy.  It  needs  a  default  rule  for 
malignancy.  Fuzzy-GA  [12]  extracted  rules  based  on  the  predetermined  number  of  rules  in  the 
range  of  1  to  5.  A  total  of  120  evolutionary  runs  were  performed.  The  highest  performance 
system  was  97.80%  correct  classification  rate  using  3  fuzzy  if-then  rules  with  4.7  conditions  per 
rule,  and  a  default  rule.  In  [26],  using  Bio-RE  algorithm,  the  best  performance  system  was 
96.96%  using  1 1  Boolean  rules  with  2.7  conditions  per  rule.  Using  Full-RE  algorithm,  the  best 
performance  was  96.19%  with  5  rules  and  1.8  conditions  per  rule  (no  default  rule).  NeuroRule 
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[23],  [29],  fuzzy-GA  [12],  Bio-RE  [26],  and  Partial-RE  [26]  have  a  default  rule  that  means  they 
lack  of  completeness.  Default  rules  do  not  provide  a  symbolic  interpretation  of  the  decision  other 
than  that  “because  none  of  the  above  occurred”  [26]. 

Based  on  ten  runs,  our  proposed  HIS  achieved  98.13%  correct  classification  for  all  683 
patterns.  The  HIS  used  ILFN  with  3  hidden  nodes  and  FES  with  3  fuzzy  if-then  rules  and  2.7 
conditions  per  rules.  An  advantage  of  the  proposed  HIS  is  that  it  incorporates  an  incremental 
learning  characteristic  in  the  system.  Since  data  can  be  made  available  on  a  daily  basis,  using  the 
proposed  HIS  the  novel  data  can  be  added  into  the  system  quickly  without  spending  too  much 
time  on  retraining  all  the  old  information. 

IV.  CONCLUSION 

Combination  of  a  trained  ILFN  network  and  a  fuzzy  expert  system  (FES)  into  a  unified 
structure  results  in  a  “Hybrid  Intelligent  System”  (HIS)  useful  for  decision-making  frameworks. 
The  proposed  HIS  offers  mutually  complementary  advantages  from  an  ILFN  network  and  a  FES. 
This  system  can  be  useful  for  complex  real-world  applications,  in  particular,  medical  diagnosis 
where  different  processing  strategies  have  to  be  supported. 

A  mapping  mechanism  from  high-level  linguistic  knowledge  to  a  low  level  ILFN 
network  is  provided  to  continuously  maintain  consistency  between  low-level  and  higher-level 
modules.  This  allows  an  expert  to  add  or  revise  linguistic  rules  to  the  system.  New  knowledge  is 
then  mapped  back  to  the  ILFN  structure  allowing  the  ILFN  network  to  update  its  parameters. 

Fuzzy  rules  in  the  FES  are  generated  directly  from  the  ILFN  network  using  the 
“ilfn2rule”  algorithm.  After  using  the  ilfn2rule  algorithm  to  map  the  ILFN  numerical  variables 
to  linguistic  variable,  the  genetic  algorithm  is  used  to  improve  file  rule  set.  Based  on  the  initial 
rules  extracted,  the  number  of  rules  and  features  of  the  FES  are  optimized  by  using  the  genetic 
algorithm.  A  compact  FES  with  only  essential  discriminatory  features  is  obtained. 

The  trained  ILFN  and  the  optimized  FES  is  combined  into  a  hybrid  system.  It  is  found 
that  the  combination  of  the  optimized  rule  base  with  the  trained  ILFN  achieves  better 
classification  results  both  on  training  and  testing  patterns.  It  is  also  found  that  some  rules  in  the 
original  rule  set  extracted  from  the  trained  ILFN  may  conflict  with  each  other.  However,  after 
using  the  genetic  algorithm  to  refine  rales  and  features,  ibe  rale  confliction  can  be  resolved. 

The  resulting  knowledge  from  the  proposed  rale  extraction  procedure  is  represented  in 
“if-then”  linguistic  form  that  is  easily  comprehensible  to  human  users.  By  integrating  the  FES 
and  ILFN,  explanations  and  answers  can  be  easily  generated  when  needed  while  numerical 
accuracy  is  maintained. 

Computer  simulations  using  the  well-known  Wisconsin  breast  cancer  database  were 
performed.  The  low-level  ILFN  has  only  few  hidden  nodes  and  the  higher-level  linguistic  model 
extracted  has  very  small  number  of  rales.  The  trained  ILFN  and  the  fuzzy  linguistic  rales  are 
combined  to  a  HIS  achieved  very  good  results  based  on  the  classification  performance  compared 
with  the  original  system  as  well  as  other  rale-based  methods. 
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ABSTRACT 

The  growing  demand  in  system  reliability  and  survivability  under  failures  has  urged  ever 

tV °”  ?  reI,°Pme“  0f  **  — odaZ  In 

research  work  the  on-line  fault  tolerant  control  problem  for  dynamic  systems  under 

unanticipated  failures  is  investigated  fen.  a  realistic  point  of  vievTwithout anTJSS 

assumption  on  type  of  system  dynamical  structure  or  failure  scenarios  The  „ P  a 

sufficient  conditions  for  system  on-line  stability  under  catastrophic  failures  tale  “a 

using  the  discrete-time  Lyapunov  stability  theory.  Based  upon  the  existine  control  theory  Hi 

modem  intelligent  techniques,  an  on-line  faul^accommodln  c^S^if^dt 
deal  wtth  the  desmed  trajectoiy-tracking  problems  for  systems  suffering  from^arious  ante™ 

nmh'l™”f,/’f“,,CataSt[0ph'f  C0mp0”ent  failmes-  Theoretical  analysis  indicates  that  the  control 
problem  of  interest  can  be  solved  on-line  without  a  complete  reali7atinn  nf  th<»  „r,u  t  -i 

real  for  Z 

constructed  to  v^^^^  £  *£ 

experiment  show  encouraging  results  and  promising  futures  of  on-line  real  time  fault  t  1 
control  based  solely  upon  insufficient  infection  ofie  system  d^s“f^  moZ 

l.  Introduction 

The  quest  for  the  system  reliability  under  failures  has  drawn  significant  attention  in  «, 

“ylaTsIrS'r^ 

concemV'toeii^erenVllIlanrCalii,'1'C<"re  °f  "*  system- 1,16  system  stability  becomes  fcriticS 

stability  under  failure  intidrals  sLio^lylnpacts  ^un^s^ivabffity^Urging  lylh^e  *g^wkg 
demands  in  system  safety  and  reliability,  extensive  research  activities ^5^ 
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failures.  Most  of  the  existing  fault  accommnHatm  u  ^  system  ^'^e  stability  under 

up  the  powerful  and  well-developed  linear  desig^  mettodolo^  “Taled^ft  '***'  ^ 
of  a  certain  type  of  nonlinear  svstemQ  ,mri^  ,  r  I0gy  or  based  uPon  the  assumption 

objectives.  The  representative  approaches  included  6  Conditions  to  obtain  the  desired 
method  [■].  eigenstructure  12S pT LOC ^ i“d-™“^<><lormodel.f„Uowi„g 

actuator  failures  [4],  reconfiguration  conji  wfflr  pLLS IdLSrTfsT  IZ  ^  *? 
placement  [6],  and  the  learning  aonroarh  n  si  u„P  .  !aenTltlcatlon  PJ>  state-space  pole 
all  the  systems  are  inhSty  T ,S  rarely  ,he  «*«  “  Practice  since 

much  more  likely  to  be  nonlinear  and  time  varvino  ??!  dy“jnJcs  ™der  fallure  situations  are 
are  relatively  easy  to  realize  and  solve  under  tfwaawi  0US?, ,®“  crm0°l  iaw  reconfigurations 
practical  Juations  is  also  liSS ."S^STT  a  T" StrUCtmS’1 “-fulness  in 
unanticipated  catastrophic  failures  it  is  not  a  rMc  ’  ui  &  dynamic  system  under  totally 
dynamic  changes  caused  by  the  unexpected  failures  appf°ach  J°  assume  certain  types  of 
control  point  of  view  the  interestinp  nnpcti  u  rea^stlc  on-line  fault  tolerant 

failures  is  still  controllable,  how  to  effectively  anTlffi  1116  system  under  catastrophic 

change  of  the  system  dyumics, md‘ *• 
system  failures,  maintain  the  system  online  stabilitv  arTn  u,  accommodate  the 

sssr by  using  on,y  ■he  fcoSr;:,::,  sr 

n?°'  bTblems  for  nonlincar 

approached  without  any  specific  assumnti™  „  a  »,  t  gated>  The  problems  are  analyzed  and 

kind  of  failure  scenarios^  m^oTSmd  structure  i  certain 

effective  on-line  failure  accommodation  technique  for  cencrT  °f,U“S  P!Per  ,s  to  develoP  “ 
unanticipated  component  failures  in  real-worid  annlLt  n°nlm®ar  dynamic  systems  with 
Lyapunov  stability  theory,  the  neceS Jv Id  ^fi  ?P^ I  T-  BaSed  Up°n  discrete-time 
stability  and  performance  under  catastrophic  fail  c°n<^ltl0ns  to  guarantee  system  on-hne 

on-line  fault  "tolerant  cont^^^^^  A  general  intelligent 

control  law  reconfiguration  problems  This  nancr  ic  ^  with  file  on-lme  fault  detection  and 
the  on-line  fault  a^omldaPoncon^l  problernffr?  ^  f0^  ^  SeCti°n  2’ we  defi^ 
analyzed  under  the  general  formulation  an/the  necessa^  rnd  suffic^?011  r  ^  Pr°bIem  is 
the  systems  on-line  stability  are  derived  through  themJSli  d  lsu£flcl^nt  condlti°ns  to  maintain 
accommodation  control  strategy  together  with  so™  .  1anaIysis;  ^  mtelhgent  on-line  fault 
Section  4.  Tfie  simulation  ^  m  proposed  “ 

proposed  on-line  control  framework  in  variousSfS  5  t0  effectiveness  of  the 

demonstrate  how  the  developed  strategy  can  he  •  f  CaSeS,  .Sectlon  6  is  dedicated  to 
hardware  under  the  real-time  environment  Th  implemented  m  commercial-off-the-shelf 
fte  discussion  of&m™ch~“  CO“Cta0n  'S  tod“ded  »  ’  along  with 


2 


2.  FTC  PROBLEM  OF  INTEREST 

A  genera!  n  -input,  m  -output  dynamic  system  can  be  described  by  Equation  (1) 

yi(k  +  d)  =  fi(yi,y2,...ym,ui,u2,. 

yi={yl(k  +  d-l),y.(k  +  d-2),...,y.(k  +  d-pi)}, 

“j  =  iuj  (k)>  ttj  (k-1  ),..Mj(k-q.)},  (1 ) 

Pi,qj  e9i  ,  i =  1, 2, to.  ,  j  =  1, 2,..., «. ,  and  l  =?.l,  2, ...,  m., 

where  /, : «'  x  «««  * ,  with  P  =  ±Pi,  Q  =  ±qj  is  the  mathematical  realization  of  the 

system  dynamics  for  the  /th  output.  yltyt,Uj  e  w'le  the  /th  and  /th  system  outputs  and  yth 
mput  respectively,  d  is  the  relative  degree  of  the  system  (the  smallest  delay  from  the  input 
signal  to  the  system  output).  In  general,  /,  may  not  be  readily  available  in  mathematical  format 

all  the  tune  due  to  the  difficulty  of  modeling  a  complex  dynamic  system.  However  it  is  possible 
to  develop  a  realization  to  describe  the  system  behavior  off-line  with  a  known  bounded 
uncertainty  within  the  desired  working  region  of  the  system  using  all  the  existing  modeline 
techniques.  These  techmques  include  the  state-of-the-art  computational  intelligence  technoloev 
such  as  artificial  neural  networks  or  fuzzy  logics  provided  enough  computfng  resource  and 
su  cient  time  for  the  development  of  the  realization  [9-1 1]  as  shown  in  Equation  (2) 

yi^  +  d)  =  fl(y1,y2,...,ym,ui,u2,...un)  +  t]l(y,u)>  (2) 

where  ||>7,0^)||2  <<50,  VQmO  e  (Y,U),  (Y,U)  represents  the  desired  working  regime,  and 

*°  e  «  is  a  known  constant.  Thus,  /, ,  the  realization  of  the  real  system  with  a  known  bounded 

™enV^  n  ue  d/Sirer  WOrking  region  of  the  system>  will  be  either  a  mathematical 
encai.  or  a  combined  realization  and  it  is  assumed  that  this  realization  is  develoDed  offline 

Jjf  EqUatl°,n  (1)  denotes  a  healthy  system  under  the  fault-free  situation  and  Equation 

(2)  the  corresponding  nonnnal  model.  Under  different  component  failures  the  svstem 

dynamics  is  represented  by  the  following  equation  P  ’  tlle  system 

y,(k  +  d)  =  f,(-} >l,y2,...y„,ui,u2,...un) 

^(y\>y2>---ym,ui,u2,...un,k),  (3) 

where  H* *  represent  the  dynamic  change  (a  genera. 

function  depends  npcn  pact  system  outputs,  past  control  inputs,  and  the  current  control  input. 

caused  by  the  unknown  and  possibly  unanticipated  failure  mode  v  for  the  /th  output. 

A  O .  and  T'„  are  assumed  unknown  due  to  the  possible  occurrence  of  unanticipated  failures,  r 
is  e  number  of  system  failures.  All  the  cases  in  which  r  >  1  are  referred  tn  a*  mnitfnU  r  -i 

Mr  TZ°  ‘5TiCali&ahS>  “?ient  faults  ^  Mts,  are  considered  to  be  invoked'  omte 
Then  characteristics  can  be  described  by  the  time-varying  gains:  the  7^  ^ 
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Pi (k - 7;' )  =  (1  - e~a'Ak~T*))U (k -T');  the  abrupt  fault:  p'v(k-Tj)  =  U(k-T').  a '  e  9T  is  an 
unknown  constant  which  defines  the  time  profile  of  the  incipient  failure  mode  v  and  U(k) 
denotes  the  unit  step  function.  Abrupt  failures  are  used  to  represent  the  sudden  change  of  the 
system  dynamics  due  to  catastrophic  malfunction  or  failure  of  the  system  component  while  the 
incipient  failures  are  used  to  describe  the  time-varying  effect  of  the  system  component-aging 
behavior. 


The  control  objective  is  to  generate  appropriate  control  signals  to  stabilize  the  system  and, 
possibly,  drive  the  system  outputs  back  to  the  desired  trajectories,  yu  (k  +  d)  e  91,  /  =  1, 2, . . .,  m. , 

in  on-line  situations  with  the  presence  of  the  abrupt/incipient  faults  and  modelling  uncertainties. 
Of  course,  many  system  failure  situations  are  catastrophic  and  uncontrollable.  For  example,  if  the 
sensor  loop  malfunctions  such  that  all  the  readings  from  the  sensor  are  lost  or  meaningless, 
without  knowing  the  true  failure,  the  only  way  to  possibly  maintain  the  system  safety  is  by 
having  human  interference  such  as  shut-down  of  the  system,  failure  diagnosis,  and  replacement 
of  the  faulty  parts.  If  the  failures  actually  break  down  the  control  input  to  the  system,  there  is  no 
way  to  perform  any  control  recovery  on-line.  In  on-line  situations,  the  interesting  and  important 
question  becomes  how  to  properly  control  the  system  behavior  in  time  to  prevent  the  failure  from 
causing  more  serious  lost,  if  the  system  under  failure  is  still  controllable  at  that  time.  Figure  1 
shows  the  interest  problem  region  of  this  research,  the  curve-line  area,  where  the  system 
behavior  under  failures  are  out  of  the  nominal  working  area,  but  still  controllable  and  within  the 
physically  available  working  region.  The  major  FTC  objective  is  to  prevent  the  faulty  system 
from  moving  into  the  saturation  regime  and  possibly  drive  it  back  to  the  nominal  condition. 


Con  tro  llab  ility 


U  ncontro  liable 

. ..t . 


saturation 


Physically  available  working  region 

Figure  1.  FTC  problem  region  of  interest 


System  s  ta  tu  s 


3.  On-line  Reconfigurable  Control 


3.1  Theoretical  foundation  and  analysis 

Without  loss  of  the  generality,  we  let  d  =  1  and  consider  the  SISO  system  to  facilitate  the 
analysis  and  derivation.  Define  a  S  function  representing  the  desired  dynamics  as  shown  in 
Equation  (4) 
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CYM  _  y<i(k)  yd(k- 1)  y(k)  -  y{k  - 1) 

( }  a? - ^ — 1 + *<y*  w  -  m)  > 


(4) 


where  yd(k)  and  y(k)  represent  the  desired  system  output  and  the  actual  system  output  at  time 

*  repreSentS  ^  SamplinS  peri0d  *  is  a  Positive  real  number  which 

7  Zft SySt6m  °U?Ut  Wil1  C°nVerge  t0  **  desired  The  desired  dynaLTc*! 
then  be  described  by  settmg  the  5  function  equal  to  zero  (i.e.,  S  =  0 )  and  it  is  auction  of 

ackmg  error,  e(k)  -  yd  (k)  -  y(k ) .  According  to  the  discrete-time  Lyapunov  stability  theory,  if 

we  let  V(e(k))  =  S2(e(k))  as  the  Lyapunov  fimctior  candidate,  the  controller  design  objective 
becomes  satisfying  V(e(k  +  l ))ZV(e(k)).  (To  simplify  the  notation,  e  will  be  eliminated  from 
the  remaining  of  this  paper.)  This  implies  S2(k  +  l)  <S2(k) 

[S(  +l)  +  W)][S(k  +  V~  W)]  <  0,  which  is  the  same  as  satisfying  the  following  inequalities, 


For  S(k)>  0, 
plugging  in  Equation  (4),  we  have 


-  S(k)  <  S(k  +  1)  <  S(k)  when  S(k)  >  0 
S(k)  <  S(k  + 1)  <  -S(k)  whenS(k)<  0. 


(5) 


*{»•)  <  Ll 1 12  -  y± (*)  y(Jj  + 1)  -  y(£) 


At 


- + a(yd  (&  + 1)  -  y(k  + 1))  <  s(k) .  (6) 


as 


Reorganizing  the  inequality,  we  get 

-  S(k)  -  Y(k)  <  (-a  -  )y(k  + 1)  <  S(k)  -  Y(k ) , 

nrhoro  V/'Jr}  —  y<t(k +  y)  ~  yd(k)  +  y(k) 

(  At  -  +  ayd  (£  + 1) .  This  can  be  further  simplified 

(f(£)  +  S(k))(a  +  ~yl  >  y(k  + 1)  >  (Y(k)  -  S(k))(a  +  —  .)-» 
Af  At 

(Y(k)  -  S(k))(a  +~y'  >  y(k  + 1)  >  (Y(k)  +  S(k))(a  +  —Y 
m  At 


For  S(k)<  0 ,  we  have 


(7) 


(8) 


-  A,  (9) 

Notice  that  the  left  hand  side  and  the  right  hand  side  of  ineaualities  m  „  u 

y{k  + 1)  =  Ny(k  + 1)  +  Jy(k  + 1) 

=  Ny(k  + 1)  +  Ny(k  + 1)  +  nfy{k  + 1)  +  nfy{k  + 1) ,  (10) 

where  Ny(k+ -I).  **  + 1),  m+l),  W?(*+l),  and  „^  +  ,)  denote  the  nominal 

m,  e  at  ure  ynamics,  the  nommal  model,  the  remaining  uncertainty  between  the  nominal 
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system  and  the  nominal  model,  the  on-line  estimator,  and  the  remaining  uncertainty  between  the 
estimator  and  the  failure  dynamics,  respectively,  (i.e.,  Ny(k  + 1)  =  Ny(k  + 1)  +  Ny(k  + 1)  and 

Mk  +  l)-  njy(k  +  l)  +  njy(k-rl)).  The  effective  control  input  satisfying  the  inequalities  (8)  or 

(9)  can  be  found  using  optimization  algorithms.  Modem  population-based  optimization 
techniques  such  as  the  genetic  algorithm,  evolutionary  strategy, iLmTS 

“8’r  hlVetbeen  exp.loited  in  a  variety  of  areas  and  applications  [12-15].’  However 
although  the  effectiveness  in  achieving  successful  optimization  objectives J  has  been 

demonstrated,  most  of  them  are  applicable  in  off-line  situations  at  present,  due  to  the  time 
consuming  iterative  process.  From  the  computational  complexity  point  of  view  the  well-known 
and  efficient  gradient  descent  algorithm  will  be  considered  and  used  in  the  r S  J? Z 
paper  because  of  its  popularity  in  on-line  applications.  The  desired  point  at  every  time  step  is 


Desired)  =  [(£(&)  +  Y  (k))(a  +  -V1  +  (Y(k)  -  S(k))(a +—)-']/ 2 

At  At 

=  F(i)(<2  +  -L)-'. 

At 


(11) 


(13) 


Define  the  error  as 

E”°r(V  =  Desire(k)-mk  +  l)-mk  +  l)-nMk  +  i')-»fi(k  +  l).  (12) 

Mecuve  control  mpnt  can  be  searched  based  upon  the  gradient  descent  algorithm  for  square 

SErrorycl  _  dErrorQc) 

&(*)  ()  du(k) 

=  -2Error(k)[?Nkik  +  + 1)  dnfr(*  +  l)  dnjy(k  +  \). 

Sulk)  Su(t)  du(k)  a ,(k)  J' 

The  resulting  control  mput  will  be  updated  by 

“<*>-  =  “(*U  -  .  (14) 

,  .....  .  du(k)  K 

w  ere  a(k)  is  a  time-varying  learning  rate  parameter.  The  searching  procedure  is  repeated  until 

zxtzL.' 

dynamics^td  Ny(k  + 1) ,  gaining  unceriainty  of  the  nominal  sjstm,^^™ 
term «;  anJy{<k  +  l)  A  okJy(k  +  l) 

’  Su(k)  a”d  — ’sulk)  ’  Ca”n0t  be  comPuted  “ther.  So,  the  actual  searching 
procedure  is  based  upon  the  approximated  values: 


Errorlk)  =  Desire(k) - Ny(k  +X)-nJy{k+l),  and 

=  -2  Enorlk*mtH + . 


du(k) 


du(k) 


du(k) 


(15) 

(16) 


It  can  be  shown  that  the  5  function  is  bounded  as  shown  in  Equation  (17)  Mease  refer  to 

unPdT  ^  S”  ^  ^ 
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s^jA5!(i  +  l)|},s^|^(i  +  l)|},sapjA£7T0;-(4:)|},  (i.e.,  the  modeling  uncertainty,  on-line 

estimation  error,  and  the  optimization  error  are  finite  after  the  time  step,  Tf,  at  which  time  the 
control  actions  are  reconfigured  for  proper  failure  accommodation). 


where 


I<S(£  +  1)<E, 


(17) 


2  =  - 


sup  {A5>(£  + 1)|}+  sup  \njy{k  + 1)|}+  sup  \&Error{k)§ 
ift>r  +  ^l)+  SUP  bfyik  + 1)|}+  sup  ^AError(k)§ 


/ 


(a  +  -^-) ,  and 
At 

(a  +  77)- 

At 


The  discrete-tune  Lyapunov  stability  theory  indicates  that  the  control  problem  can  be  solved 
as  ong  as  the  numerical  value  of  the  failure  dynamics  is  realized  at  each  time  step  which  is  a 
measurement  of  how  far  the  failure  drives  the  system  dynamics  away  from  the  dedrei  d^amfcs 
Based  upon  the  above  theoretical  analysis,  the  system  under  unexpected  catastrophic  failles  can 
be  stabilized  on-line  and  the  performance  can  be  recovered  provided  an  effective  on-Z 

Stlltr  SUCh  tha‘ the  ”eCCSSaiy  a"d  condition  is 

satisfied.  Furthermore,  smce  the  on-lme  estimator  is  used  to  provide  the  approximated  numerical 

the  ft Mf  he  felwdynamiCS  at  each  time  steP  based  uP°n  the  most  recent  measurements  (i e 
the  failure  may  be  time-vaiymg),  no  specific  structure  or  dynamics  is  required  for  the  estimator’ 

°SC  “  tha*  a“ates  -en.  t:h  “„a  :rf 

the  fa, lure  ,s  needed  for  the  control  pmpose.  At  every  time  step,  the  desired  point, 
Y(k)(a  *P~'- is  comPu,ed-  “d  the  effective  control  signal  is  searched  to  ensure  that  the  actual 

“  P0SSMe  toU8h  46  r“”  °f  "  ^tem 

3.2  On-line  fault  tolerant  control  laws  for  a  class  of  nonlinear  systems 

.,F the  nominaI  system  behavior  under  the  fault-free  condition  can  be  described  bv  the  so- 

toS on  of  “T  °m  "  con"'oIlability  canonical  fonn  (i.e.,  linear  in  control)  and  the  fault  is  a 
function  of  system  outputs,  th e  first  on-line  control  law  can  be  derived  for  real-time  fjl™ 
accommodation  as  shown  in  Equation  (18)  11 


u(k)  =  [Y (k)(a  +  -L)~l  -  /(.)  -  NF(k)]—, 
At  £(•) 


1 


(18) 


where  the  nominal  model  is  in  the  form  of  y(k  + 1)  =  /(.)  +  *(.)„(*) ,  a„d  NF(k)  dcnotes  ^ 

numerical  value  of  the  failure  dynamics  at  time  step  k  realized  through  the  on-line  estimator 
Sharing  the  similar  spmt  of  the  Discrete-time  Sliding  Mode  Control  technique  [161  the 
alternative  corrective  control  law  that  possesses  more  robustness  property  for  the  control 
problems  of  our  mterest  can  be  developed  as  follows:  y 
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u(k)  =  ul(k)  +  u2(k), 

..w.JSLjffi' L™.. 


;) 


-  NF(-) 
g(  0 


(19a) 

(19b) 


where  «,(*)  represents  the  nominal  control  law  and  Tc  denotes  the  specific  time  step  at  which 

time  the  difference  of  the  sum  square  approximation  errors  of  the  on-line  estimator  durins  two 
consecutive  windows,  Q,  is  below  a  pre-specified  threshold,  S,  at  the  first  time.  This  implies 
that  the  on-line  learning  result  cannot  be  further  improved  significantly  (i.e.,  Q  <S).  <p(k) 
denotes  the  boundary  layer  thickness.  The  saturation  function  is  defined  to  be 


+  l,i/  S(k)  >  <p(k ) 

-1  ,  if  S(k)  <-<p(k) 
The  boundary  thickness  and  the  controller  gain  are  defined  as 


(p(k)  =  t}(k)  +  e , 
K(k)  =  rj{k)  +  2e , 


(20) 

(21) 


respectively,  where  e  is  an  arbitrary  positive  constant  and  rj(k)  will  be  updated  using  the 
following  equation 


*lnew(k)  = 


sup|z>(£)A/ (.)| }=  supj  D{k{ £ p. (•)/. (•)  -  iVF(-) 
L  [  '=> 

rjoU  ,  otherwise 


■ ,  if  Q  <  S 


and 


nrM  S(k)-S(k~  1) 


(22) 

(23) 


The  boundary  layer  thickness  is  now  redefined  by  the  least  upper  bound  of  the  failure 
uncertainty,  on-lme  approximation  error,  as  shown  in  Equations  (20)  and  (22)  by  tracking 

ZAOW)  Ci-e-  multiple  failures)  on-line.  Once  a  fault  is  detected,  the  control  signal  is 

adjusted  by  adding  the  corrective  control  signal,  u2(k),  to  confine  the  system  performance 

within  a  boundary  layer  using  the  first  term  of  the  right  hand  side  in  Equation  (19)  At  the  same 

ZSfTT  mtldlZed  fd  leamS  (aPProximates)  the  unknown  failure  mode  dynamics 
on-lme.  The  on-line  learning  result  is  monitored  and  evaluated  by  using  the  following  criterion: 


ko+l—1 

SSAE  0=  Yj(fy(k)~nfy(k)y 

k=kn 


(24a) 
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(24b) 


*0+2  M 


SS/LEl-  ^(fy{k)-njy(k))‘ 
*=*0+/ 

Q  =  |££4£l-S&4£0| 


(24c) 

where  SSAEO  and  SSAE1  stand  for  the  sum  square  approximation  errors  of  the  on-line  estimator 
dunng  two  consecutive  windows.  „Mk)  and  fy(k)  am  the  output  of  the  estimator  and  the 
difference  between  the  measurement  and  the  output  of  the  nominal  model  at  time  step  k 
respectively.  The  design  parameter  /  represents  a  time  period  such  that  the  least  upper  bomb  of 
the  approbation  error  is  evaluated  every  time  period,  L  =  [k-/,k],  Equations  (20)-(22)  state 

that  both  the  boundary  layer  thickness  and  the  controller  gain  are  automatically  estimated  and 
adjusted  on-line  by  the  estimator  to  furtoer  reduce  con trof  emor.  *a 

should  be  men honed  here  is  that  by  adding  U(k~Tc)  in  the  second  tenn  of  the  right  haL  side  in 

Equation  (19b)  to  delay  the  compensation  of  the  nominal  control  law  until  the  convergence  of 

t  ,  ,™8.  result’  n  1S  assumed  that  the  system  under  nominal  control  law  will  not  lose  the 
stability  before  time  step  Tc .  Tins  delay  usually  results  in  less  sensitivity  in  the  se“0f  fte 

learning  rate  m  the  learning  process  and  a  better  transient  performance.  The  detail  of  analysis 
iscussion,  and  simulation  tests  of  both  on-line  control  laws  has  been  reported  in  [17-18],  Y  ’ 

4.  On-line  Fault  Tolerant  Control  Stretagy 

Figure  2  shows  the  basic  framework  of  the  on-line  control  strategy  for  the  system  that  mav  h* 

evalua!  ?  W3^i^ted  atrophic  failures.  The  intelligent  control  regulator  monitors  md 

nnmai  ^  heh™°?  at  every  tune  mstant  a  fault  detection  mechanism  During  the 

nomial  operation  mode  that  corresponds  to  the  non-failure  situation,  the  nominal  controufc^i 

reconfigured  and  computed  by  the  regulator  based  upon  the  current  taiowledge  of  the  fSi 
djmam.cs  provided  by  the  on. line  estimator.  The  tote.hgent  con^gSSm  ^has  m  tom™ 
witt,  the  supervisor  to  accept  higher  priority  commands,  such  as  chan^  of  the  conhol  obiS 
or  design  parameters,  and  warn  the  supervisor  for  emergent  shut-down  of  the  system  in  cJes  that 
the  muuiticpated  system  failures  are  catastrophic  and  the  system  is  actually  S“hdlSe  atl 
the  failures.  Successful  fault  tolerance  mission  highly  relies  unon  the  nff  a*  * 
include  the  accuracies  of  toe  nominal  model,  co"^toe  d  s^  p"™ £ 
approximator  and  fault  detection  scheme.  The  details  are  discussed  as  fiZw” 

4. 1  Design  of_ the  nominal  model 

The  mam  design  objective  of  the  nominal  model  is  to  satisfy  Equation  (2)  within  the  decirwl 
under  failures.  It  is  toe  possibility  that  toe  nominal  model  can  be  developed  and  tested  <#-/, he 
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makes  this  goal  practically  feasible.  For  complex  dynamic  systems  whose  mathematical 
description  of  the  nominal  model  is  not  readily  available,  the  so-called  intelligent  modelling 
techniques  have  shown  effectiveness  in  the  complicated  modeling  process  provided  sufficient 
resource  and  time  for  the  satisfactory  development.  These  techniques  include  polynomials, 
rational  functions,  spline  functions,  multiplayer  perceptron  networks,  radial-basis-function 
networks,  adaptive  fuzzy  systems,  and  HME  (Hierarchical  Mixture  of  Expert  Networks)  [10-11], 


Figure  2  Basic  framework  of  on-line  fault  tolerant  control 


4.2  Design  of_ the  nominal  controller 

The  second  step  is  to  design  the  nominal  controller  based  upon  the  nominal  model.  For  the 
model  whose  mathematical  representation  is  not  in  controllability  canonical  form  or  the  model  is 
realized  through  the  intelligent  modelling  techniques,  the  closed-form  control  law  may  not  be 
easily  formulated.  In  these  cases,  neural  controllers  have  been  introduced  for  the  control  purpose 
[19-20].  However,  despite  the  fact  that  the  effectiveness  of  the  neural  controller  has  been  shown, 
the  design  or  training  process  of  the  controller  is  still  quite  complicated  since  the  training  of  the 
neural  controller  involves  a  dynamic  system.  Representative  approaches  include  the  so-called 
dynamic  backpropagation  [19],  backpropagation  through  time  [29-31],  finite  difference 
approximation,  and  SPSA  (Simultaneous  Perturbation  Stochastic  Approximation)  [21].  A  more 
computationally  efficient  design  method  of  the  neural  controller  can  be  constructed  using  the 
similar  approach  discussed  in  Section  3.  The  details  of  the  design  procedure  used  in  the  on-line 
simulation  section  is  described  as  follows: 

a.  Set  up  initial  conditions  (i.e.,  the  S  function  in  Equation  (4),  the  initial  system  states,  the 
initial  control  input,  etc.). 

b.  Compute  the  desired  point  at  each  time  step  (i.e.,  Equation  (1 1))  and  define  the  error  (i.e., 
Equation  (12)  without  the  failure  terms). 

c.  Compute  the  Jacobian  of  the  nominal  model  with  respect  to  the  control  inputs. 

d.  Search  the  effective  control  inputs  to  satisfy  inequality  (8)  or  (9)  using  Equations  (13)- 
(16)  without  the  failure  terms. 
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e.  Collect  the  training  patterns  for  the  neural  controller  (i.e.,  the  system  outputs  and  the 
effective  control  inputs  that  satisfy  the  inequalities). 

f.  Go  back  to  step  b.  until  the  time  ends. 

g.  Train  the  neural  controller  with  the  selected  network  structure,  the  collected  training 
patterns,  and  static  training  algorithm  [22]. 

This  approach  breaks  the  complex  training  process  of  the  neural  controller  into  the  gradient 
descent  optimization  and  the  static  training  method  that  appears  to  be  effective  and  requires 
much  less  computational  cost. 

4.3  On-line  fault  detection  scheme 

The  next  step  is  to  test  the  performance  of  the  nominal  controller.  Equation  (25)  shows  a 
testing  criterion  that  evaluates  the  mean  square  control  error  within  a  fixed  length  of  window 
with  cd  as  a  design  parameter. 

1  *0+0-1 

V  =  -  ZO'rfW-J'C*))2 ,  (25) 

where  k0  is  the  starting  time  step.  The  testing  result  that  indicates  the  system  behavior  under 
nominal  controller  in  the  fault-free  situation  not  only  can  be  used  for  the  evaluation  of  the  design 
of  the  nominal  controller,  but  also  provide  useful  information  for  the  on-line  fault  detection.  It  is 
known  that  the  complete  on-line  fault  detection  and  diagnosis  is  still  impossible  at  present  due  to 
the  inherent  complexity  of  the  problems  and  time  constraint  in  on-line  situations  [6].  An  ideal 
fault  detection  scheme  should  be  sensitive  to  detect  any  fault  that  deteriorates  system 
performance  and  robust  enough  to  reject  false  alarms  that  are  possibly  caused  by  modeling  error, 
disturbance,  and  measurement  noises.  Unfortunately,  these  two  goals  are  usually  conflict  with 
each  other  and  the  selection  of  proper  fault  detection  algorithm  becomes  a  trade-off  decision. 
Since  the  system  safety  under  failures  is  the  first  priority  of  control  missions,  false  alarm  is 
always  more  preferable  than  the  miss  detection.  In  other  words,  a  conservative  fault  detection 
scheme  which  guarantees  the  miss-detection-free  is  more  likely  to  be  adopted.  From  both  system 
safety  and  on-line  computational  complexity  points  of  views.  Equation  (25)  together  with  a  pre¬ 
specified  threshold  value,  X ,  is  a  possible  candidate  for  the  on-line  fault  detection  method.  By 
choosing  the  threshold  value  carefully  based  upon  the  modeling  uncertainty  and  possible 
measurement  noises,  we  will  have  an  effective  on-line  fault  detection  mechanism.  Under  this 
conservative  fault  detection  method,  miss  detection  becomes  trivial  since  the  control  objective  is 
to  keep  the  tracking  error  as  small  as  possible  within  an  affordable  control  effort  and  if  the  fault 
cannot  be  seen  on  the  tracking  error  or  it  lasts  only  a  short  transient  period  such  that  the  failure 
alarm  is  not  triggered,  the  fault  is  not  within  our  concern  (i.e.,  its  effect  on  the  system 
performance  does  not  degrade  the  control  performance.)  Of  course,  the  price  of  the  trivial  miss 
detection  and  computational  simplicity  is  the  increasing  possibility  of  false  alarm  possibly 
caused  by  unexpected  interferences  or  noises. 

4.4  Online  estimation  g£_ the  failure  dynamics 

Successful  on-line  fault  accommodation  also  requires  a  good  on-line  estimator  to  realize  the 
numerical  value  of  the  unanticipated  failure  dynamics.  Based  upon  the  fact  that  Artificial  Neural 
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Networks  have  been  proven  to  possess  the  ability  to  approximate  any  piecewise  continuous 
function  given  sufficient  neurons  in  the  hidden  layer  [23],  neural  networks  are  exploited  and 
used  as  the  on-line  estimator  for  the  unknown  failure  dynamics  in  this  research  work.  Some 
important  features  of  on-line  learning  using  neural  networks  should  first  be  addressed  here.  The 
structure  of  the  on-line  estimator  needs  to  be  decided  (i.e.,  in  neural  networks,  the  number  of 
hidden  layers,  number  of  neurons  in  each  hidden  layer,  and  neuron  transfer  functions  have  to  be 
specified).  It  is  known  that  neural  networks  are  sensitive  to  the  number  of  neurons  in  the  hidden 
layers.  Too  few  neurons  can  result  in  underfitting  problems  (poor  approximation),  while  too 
many  neurons  may  contribute  to  overfitting  problems,  where  all  the  training  patterns  are  well  fit, 
but  the  fitting  curve  may  take  wild  oscillations  between  the  training  data  points  [25].  The 
criterion  for  stopping  the  training  process  is  another  important  issue  in  real  applications.  If  the 
mean  square  error  of  the  estimator  is  forced  to  reach  a  very  small  value,  the  estimator  may 
perform  poorly  for  the  new  input  data  slightly  away  from  the  training  patterns.  This  is  the  well- 
known  generalization  problem.  Besides,  in  real  applications,  the  training  patterns  may  contain 
noises  since  they  are  the  measurements  from  real  sensors.  The  estimator  may  adjust  itself  to  fit 
the  noise  instead  of  the  real  failure  dynamics.  Some  methods  proposed  to  improve  these 
problems  such  as  early  stopping  criterion  and  generalization  network  training  algorithms  may  be 
useful  to  improve  these  situations  [24-25]. 

Under  the  on-line  environment,  the  number  of  input-output  data  for  the  training  purpose 
becomes  a  very  important  design  parameter.  It  is  apparently  unreasonable  and  impossible  to  use 
all  the  measurements  for  training  since  the  system  dynamics  may  keep  changing  (i.e.,  failure 
may  be  time-varying)  and  the  estimator  may  be  mislead  by  invalid  old  data.  A  reasonable  way  is 
to  use  the  most  recent  input-output  measurements.  A  set,  B,  that  contains  the  most  recent 
measurements  within  a  fixed  length  of  a  time-shifting  data  window  is  used  to  collect  the  training 
patterns, 


B  =  ^(m),t_(m)]pe,Rs;t_e9{T;k-  j  +  \<m<k\,  (26) 

where  p(m)  and  Urn)  are  the  network  input  vector  and  desired  output  vector  at  time  step  m , 
respectively,  k  is  the  current  time  step  and  j  represents  the  length  of  the  time-shifting  data 
window  which  is  a  design  parameter.  This  parameter  has  to  be  decided  based  upon  the  system 
computational  capability,  sampling  rate,  and  the  performance  requirement.  In  addition,  the 
maximum  number  of  the  effective  control  signal  searching  iterations  is  another  important  design 
parameter  in  real-time  applications.  It  has  to  be  within  an  allowable  range  according  to  the 
system  computational  capacity  in  the  on-line  situation.  For  gradient  descent  type  of  optimization 
algorithms,  a  time-varying  learning  rate  can  be  used  to  possibly  reduce  the  searching  time. 

5.  SIMULATION  STUDY 

5J_A  benchmark  problem:  three-tank  system 

A  well-regarded  fault  diagnosis  benchmark  shown  in  Figure  3  [26],  the  three-tank  system,  is 
used  to  demonstrate  how  to  implement  the  proposed  FTC  technique  in  a  real  application.  The 
state  equations  of  the  system  are  given  as 
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*1  =(-c,Spsign(x , -x3|  +ul)/A+fjl(x,u), 

=  (~ciS  psign(x2  ~ x3 )yj-g\x2  ~ *3 1  “ c2 Sp-JlgXi  +  ti2)/ A -q20  +  t]2(x,u), 
x3=(clSpsign(xi -x3)J^^~\-c3Spsign(x3  - x2)Jlg\x3  - x2\ )/A  +  q3(x,u). 


(27a) 

(27b) 

(27c) 


Q  20 

Outflow  rate 


Leakage  (Fault  1) 


Leakage  (Fault  2) 


Figure  3  A  benchmark  problem  (three-tank  system) 

Three  tanks  are  identical  and  have  a  cylindrical  shape  with  cross  section  A  =  0.0154  m1  The 
cross  section  of  the  connection  pipes  is  .S,  =5-  1(TS  m 2  and  the  liquid  levels  in  the  three  tanks 
are  denoted  by  x,,  x2,  and  x3,  respectively  with  (0  <  x,  <  0.69  m,  Vi  =  1,2,3) .  The  control 
inputs,  u,  and  u2,  are  the  flow  rates  coming  from  pumps  1  and  2  to  the  tanks  1  and  2, 
respectively.  q2Q  =  c1Sp3j2gx2  is  the  outflow  rate  from  the  tank  2.  c,  =  1,  c2  =  0.8,  and  c3  =  1 
denote  the  non-dimensional  outflow  coefficients,  g  is  the  gravity  acceleration,  and  tj,,  i  =  1,  2, 3 

represent  the  corresponding  modeling  uncertainty  due  to  the  inaccuracy  on  the  cross  section  of 
connection  pipes.  The  discrete-time  model  is  derived  by  using  forward  Euler  approximation 


x.  (A:  + 1)  -  x.  (k) 
x,  «  — - - - lL-2.  7  =  123 

1  9  1  5 


(28) 


where  At  =  0.1  second  represents  the  sampling  period.  Plugging  in  Equation  (28)  and  re¬ 
arranging  the  state  equations,  we  have  the  nominal  model  in  the  controllability  canonical  form 
Initial  condition  is  set  to  be  the  liquid  levels  x,(0)  =  x2(0)  =  x3(0)  =  0.15m  and  the  control 
objective  is  to  keep  the  liquid  levels  at  0.2m  (i.e.,  xld(k)  =x2d(k)  =  x3rf(*)  =0.2,  V*  >0).  The 
modeling  uncertainty  is  assumed  to  satisfy: 


|/7I(x,«)|^^.,V(x,u)6n,  i  =  1,2,3, 


(29) 
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where  n  represents  the  region  of  interest.  In  order  to  simulate  the  effects  of  modeline 
uncertainty  and  possibly  noises  in  the  measurements,  uniformly  distributed  random  values 
satisfying  Equation  (29)  with  $  =3.5x10-,  JJ2  =  2.05x10-, and  ^  =6.5x10-  are  added  to 
the  corresponding  state  equations. 

Design  of  nominal  controllers 

The  design  of  the  nominal  controller  is  based  upon  Equation  (18)  without  the  on-line 
estimator  and  only  the  liquid  levels  *,(*)  and  *,(*)  need  to  be  considered  in  the  controller 
design  process  since  *,(*)  will  eventually  reach  the  same  level  as  long  as  we  can  keep 
*' ik>  =  x-(kj  =  02  m  due  *e  U  tube  principle.  Thus,  the  nominal  controllers  are 


where 


“i  (*)  =  \YKk){a  +  ^r-  /1(0]  / 
u2(k)  =  [Y2{k){a  +  i-)-  -/2(-)]/g2(-), 

1  J  Xi - +  axu(k  + 1), 


(30a) 

(30b) 


y 2(i)  =  hdk  +  v>  **<»  +  *,(*)  +  ^ (i  +  t) 

/!(•),  /2(-),  gl(-),  and  g2(-)  are  the  corresponding  terms  obtained  when  we  re-organized  the 
nominal  model  mto  the  controllability  canonical  form.  The  two  S  functions,  S]  and  S  are 
defined  for  x,  and  *2 ,  respectively,  with  the  same  form  as  Equation  (4)  and  u  L 10  The  sm  of 
the  mean  square  errors  (i.e.,  ^(t)  =  [,,(l)-rl(t)r  and  *,’(*)  =[*„(*)-*,(*),’ )  within  a 

xed  length  (i.e  5  tune  steps)  of  tune-shifting  window  is  selected  as  a  criterion  to  test 
perfotmance  of  the  nonuntd  controllers  with  the  presences  of  modeling  uncertainty  anTpossibt 
norses.  Based  upon  the  tearing  results,  the  fault  detection  threshold  value,  is  then  selected “ 
2.0  x  1 0  m  steady  state  condition. 

Multiple  failures :  leakages  in  the  tanks 

dynamics  are ^  *  **  tSlikl!mi°n  indPient  l^ge  “  the  tank  2  whose  failure 

F>  (*>  =  l/2W  (*) ,  A,  (*  -  T, )  =  U(k  -  T, ),  T,  =  270,  (31a) 

(*) ■  -C^ ^ W. A (* • - T2 )  =  (1  - e-«"»Wk - T, ), a,  =  0.063, T,  =426,  (31b) 

where  r,  =7.3x10"’  and  r,  =8.4x10-’.  No  information  in  Equation  (31)  is  assumed  to  be 

known  except  that  the  state  variables  x,  (*)  and  (*)  are  measurable.  The  physical  knowledge 

of  the  system  provides  us  useful  information  to  determine  the  initial  upper  bound  ( „  )  of  the 

failure  dynamics.  Since  the  failures  are  possible  leakage  problems  in  the  tanks  (i.e  the  failures 

effeCt  CaUSed  by  1116  failure  is  ™dden  drihng  offte 
ltqtud  m iflre  tank,  which  corresponds  to  the  worst  failure  condition  where  the  tank  is  completely 

broken.  Thus,  the  initial  upper  bounds  for  failures  can  be  chosen  as  the  liquid  levels  in  the  tanks 
at  the  corresponding  tune  step.  Two  separate  multi-layer  perception  network  (MLP)  with  single 
mput  neuron,  five  neurons  m  the  first  hidden  layer,  five  neurons  in  the  second  hiddenTaya  ^d 
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single  output  neuron  (1 -5-5-1)  are  used  to  serve  as  the  on-line  failure  estimators  for  F]  and  F, , 

^Spif.CTtl^ely’  wi*  the  static  backpropagation  method  as  the  training  algorithm.  The  selection  of 
the  MLP  network  structure  is  a  design  parameter  and  may  not  be  optimal  in  this  case  Generally 
speaking,  a  more  complicated  structure  may  be  required  for  a  more  complex  function  to  achieve 
a  better  performance  and  the  computational  cost  is  expected  to  increase  with  the  complexity.  The 
on-line  approximation  result  is  monitored  by  the  criterion  shown  in  Equation  (24)  with  /  =  10 
and  the  least  upper  bound  of  the  failure  uncertainty  is  computed  according  to  Equation  (22) 
Figure  4  shows  the  liquid  levels  in  the  tanks  under  the  nominal  control  law  alone.  As  the  first 
eakage  m  tank  1  occurs  the  liquid  level  1  drops  quickly  causing  dropping  of  the  liquid  level  in 
the  tank  3.  As  the  second  leakage  problem  occurs  in  the  tank  2,  the  liquid  levels  eventually  drop 
below  the  initial  condition.  Applying  the  proposed  control  technique  with  the  corrective  control 
aw  (  quation  (19)),  we  observe  significant  performance  improvement  by  proper  reconfiguration 
of  the  control  inputs  flow  rates  from  pump  1  and  2,  as  shown  in  Figure  5.  The  implementation  of 
die  first  control  law  (Equation  (18))  for  this  problem  is  straightforward  and  the  result  is  shown  in 
igure  6.  (i.e.,  with  the  same  MLP  estimators  and  the  delay  of  compensation.) 
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Figure  6  System  on-line  response  with  the  first  control  law  ~  ~ 

5.2  On-line  fault  accommodation  for  multiple  time-varvine  failures 

In  order  to  demonstrate  how  to  use  the  proposed  FTC  strategy  for  the  general  nonlinear 
system  under  the  general  failure  scenario,  a  MIMO  system  under  multiple  time-varying  failures 
is  considered  as  shown  in  Equation  (32), 


*  (k  + 1)  =  l  +  y2(k)2  +  U>  (k)2  +  “2  (k)2  " 16Wl  (k) ” 2°“2 (k)  +  Af' 

y 2  (* + 1)  = Y+y^y  +  {k)Ul  {k) + 20u' (k)  ~  5“2  (k) + (k)> 

k-  25 

A/*  W  =  Ao(^-^io) x  0.1  x  —  y,  (k)cos(ux ( k ))  +  fiu(k  —  Tn)x  0.6y,  (&)y2  (£), 
A/2W  =  Ao(^-Ao)x0.1xy1  (k)y2  (£), 


where  Pw(k-Tw)  =  U{k-Tw),  Pu(k-Tu)  =  U(k-Tu),  P20(k-T20)  =  U(k-T20),  T10=25, 
r20  =15,  and  Tn  = 123 .  The  nominal  system  is  first  realized  through  a  4-75-2  MLP  neural 

network.  2,000  input-output  training  patterns  are  collected  by  supplying  uniformly  distributed 
random  inputs  varying  from  -1.5  to  1.5.  A  3-4-2  MLP  network  is  chosen  as  the  on-line  estimator 
and  the  Levenberg-Marquardt  with  Bayesian  regularization  algorithm  [24-25]  is  used  in  the 
training  process  for  both  the  NN  nominal  model  and  the  NN  on-line  estimator  (i.e.,  a  4-40-2 
MLP  network  is  used  as  a  nominal  controller  trained  off-line  by  following  the  design  procedure 
shown  in  Section  4.2).  The  desired  trajectories  for  outputs  y,  and  y2  are  generated  by  Equation 
(33) 
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solid  line:  system  output  y  1  dashed  line:  desired  output  yld 


Figure  7  System  response,  yl,  vs.  desired  output,  yld 
(MIMO  system;  25-data  window;  consecutive  abrupt  faults) 

solid  line:  system  output  y  2  dashed  line:  desired  output  y2d 


time  step 

Figure  8  System  response,  y2,  vs.  desired  output,  y2d 
(MEMO  system;  25-data  window;  consecutive  abrupt  faults) 

The  testing  criterion  for  the  nominal  controller  is  similar  to  Equation  (25)  with  the  sum  of  the 
mean  square  error  for  both  outputs  and  the  length  of  evaluation  is  selected  as  5.  Based  upon  the 
testing  result,  the  threshold  value  of  the  on-line  fault  detection  scheme  is  chosen  as  9xl0'5 
under  noise-free  situations.  No  information  pertained  to  the  failures,  A/,  and  A/2 ,  are  assumed 
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known.  Two  different  S  functions  are  defined  for  y,  and  y2  in  the  same  form  of  Equation  (4) 

with  <2=10,  respectively.  A  simple  mean  value  of  the  two  estimated  gradient  directions  realized 
through  the  NN  on-line  estimator  is  used  for  the  searching  of  the  effective  control  inputs.  The 
length  of  the  time-shifting  data  window,  j  in  Equation  (26),  is  selected  as  25.  The  on-line 
simulation  speed  can  reach  2-3  time  steps  per  second  under  Intel  Pentium-II  450  dual  processors. 
Figures  7  and  8  show  the  system  response  for  the  first  output  and  the  response  for  the  second 
output  together  with  the  desired  outputs,  respectively,  while  the  nominal  controller  alone  fails  to 
maintain  the  system  stability  under  multiple  failures. 

6.  Real-Time  experiment 

To  obtain  a  comprehensive  insight  for  quantification  of  the  design  parameters  and  the  real¬ 
time  control  system,  an  on-line  fault  tolerant  control  test  bed  to  validate  the  proposed  on-line 
fault  tolerant  control  strategy  in  real  hardware  has  been  constructed.  The  hardware  setup  is 
shown  in  Figure  9.  It  consists  of  the  following  major  components, 

1 .  a  BALDOR  dc  motor  with  maximum  Vi  hp, 

2.  a  MAGTROL  HD-505-8N  dynamometer, 

3.  a  MAGTROL  6200  dynamometer  controller/readout, 

4.  one  ADVANCED  dc  motor  amplifier, 

5.  dSPACE  software,  DS1102  board  and  cable  box  with  Texas  Instruments 
TMS320C31  floating-point  Digital  Signal  Processor  (DSP),  and 

6.  NT  workstation  with  Intel  Pentium  11-450  dual  processors. 


Figure  9  Hardware  experiment  setup 


The  dc  motor  is  connected  to  the  dynamometer  that  is  used  to  generate  unanticipated  friction  on 
the  motor  shaft  to  simulate  the  unanticipated  system  failures.  The  control  objective  is  to  maintain 
the  rotational  speed  of  the  motor  (i.e.,  in  terms  of  rpm)  to  the  desired  patterns  with  the  presence 
of  the  unanticipated  simulated  failures.  A  computer  with  Intel  Pentium  11-450  dual  processors  is 
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used  to  simulate  the  intelligent  control  regulator,  fault  detection  mechanism,  and  on-line 
estimator.  An  embedded  encoder  and  sensor  in  the  dynamometer  provide  motor  rpm  and  torque 
measurements  in  real-time,  respectively.  The  measured  signals  are  connected  to  a  MAGTROL 
6200  dynamometer  controller/readout  with  the  on-line  readings  shown  on  the  device  screen  and 
the  same  signals  are  sent  to  a  dSPACE  DS1102  cable  box  that  is  connected  to  the  workstation 
through  the  TMS320C31  DSP  board.  An  adjustable  break  dial  on  the  front  panel  of  the 
dynamometer  controller  is  used  to  generate  the  simulated  time-varying,  unknown  and 
unanticipated  breaks  (i.e.,  unanticipated  workload  on  the  dc  motor).  All  the  necessary 
computation  and  the  appropriate  control  input  to  drive  the  motor  are  computed  within  the 
workstation.  The  DSP  board  and  dSPACE  software  are  used  to  provide  the  necessary  interface 
(A/D  and  D/A  converter)  and  the  integration  of  the  real-time  control  with  high-level  languages 

including  MATLAB,  SIMULINK,  and  C  programs.  A  picture  of  the  real-time  fault  tolerant 
control  test  bed  is  shown  in  Figure  10. 


In  real-time  environment,  to  close  the  on-line  control  loop  as  shown  in  Figure  9,  an 

application  source  code  (i.e.,  obj  file)  has  to  be  created  and  downloaded  to  the  TMS320C31 

DSP.  The  on-line  fault  detection  scheme,  failure  estimation,  and  control  algorithm  are  performed 

under  Matlab  workspace  in  the  NT  workstation  which  communicates  with  the  DSP  through 

dSPACE  MLIB  (Matlab-dSPACE  Interface  Library).  Figure  1 1  shows  the  SIMULINK  model 

that  is  used  to  create  the  application  source  code  for  the  real-time  experiment.  One  14-bit  D/A 

converter  channel  and  one  16-bit  A/D  converter  channel  are  used  to  generate  the  control  input 

(i.e.,  motor  input  voltage)  and  collect  the  torque  reading  from  the  dynamometer 

controller/readout,  respectively.  A  discrete  filter  is  used  to  reduce  the  effect  of  measurement 

noises  m  the  torque  readmg.  The  rotational  speed  reading  is  decoded  through  one  DS  11.02 

encoder  interface  channel  with  a  24-bit  counter.  The  DSP  with  generated  application  code  runs 

the  hardware  expenments  m  real-time  with  sampling  period  0.01  second  and  the  control  signal 

generated  by  the  computer  is  sent  to  regulate  the  real-time  response  by  changing  the  constant 
value  m  the  SIMULINK  model.  gmg  me  constant 
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Figure  11  The  SEVfULINK  model  for  the  real-time  experiment 

Following  the  design  procedure  shown  in  Section  4,  the  first  step  is  to  obtain  a  nominal 
model  for  the  fault-free  system.  It  is  well  known  that  a  dc  motor  can  be  modeled  as  a  linear  time- 
invariant  system.  For  an  armature-controlled  dc  motor  with  the  negligible  time  constant  of  the 
armature,  the  nominal  transfer  function  can  be  represented  by  Equation  (34)  [27], 


G(s)  = 


>*<*) 

V(s) 


Km _ 

[Ra(Js  +  f)  +  KbKmy 


(34) 


where  w(s)  and  F(s)  denote  the  rotational  speed  and  motor  voltage  in  s  domain,  respectively. 
Kb,  Km ,  Ra ,  J ,  and  /  are  motor  constants.  Equation  (34)  can  be  re-organized  as  Equation 
(35)  with  A,  b ,  c  representing  the  corresponding  constants, 

Aw+bw  -cV .  (35) 

Using  the  forward  Euler  approximation  shown  in  Equation  (28),  the  discrete-time  nominal  model 
can  be  derived  and  shown  in  Equation  (36), 

w(k  + 1)  =  [1  -  bAt  /  A]w{k)  +  [cAf  /  A]V{k) 

(36> 


which  is  in  the  controllability  canonical  form.  The  next  step  is  to  identify  the  parameters, 
/linear  =  t1  ~  bAt  /  A]  and  gHnear  =  [cAf  /  A] .  Since  Equation  (36)  is  a  linear  time-invariant  system, 
the  batch  form  least  square  estimation  method  can  be  used  for  the  identification  of  the 
parameters  [28].  With  the  zero  initial  condition,  20,000  sets  of  input  signals  generated  by 
Equation  (37)  are  sent  to  the  system  for  the  collection  of  the  system  responses,  w(k) , 
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(37) 


Jc7T 

V{k )  =  0.015 xsin(-^^)  +  0.015,  k  =  1,2,. ..,19,999. 


The  batch  form  least  square  method  provides  the  parameter  estimation  as  follows  [28], 

3 "linear 


’w(20,000)" 

'w(l  9,999) 

V  (19,999) 

z  = 

w(l  9,999) 

,H  = 

w(l  9,998) 

V  (19,998) 

w{2) 

w(l) 

r-  . 

V(\) 

“i 

, e = 


&  tin 


Z  =  HO  +  v ,  and 


@LS  ~ 


flii 


linear 


Slit i 


=  [HrH]-'HrZ, 


(38) 


* 

where  v  and  0LS  represent  the  white  noise  and  the  least  square  estimation,  respectively.  The 

design  of  the  nominal  controller  follows  Equation  (18)  without  the  term  of  fault  estimator  as 
shown  in  Equation  (39), 


W ^  _  W desired  (*+!)“  W desired  (k)  ~  W(k)  .  r,  .  n  , 

r  (k) - — - +  awiesired  (k  + 1)  and 

Kommal  (k)  =  17 (*)(<*  +  “-)_1  ~  f linear  X  ~ 

At  Zr 

o  linear 


(39) 


with  a  =  1  and  the  S  function  defined  in  the  form  of  Equation  (4). 

The  final  step  in  the  off-line  design  stage  is  to  evaluate  the  nominal  model  accuracy  and  the 
performance  of  the  nominal  controller  under  the  fault-free  environment  for  proper  selection  of 
the  design  parameters  in  the  on-line  fault  detection  scheme.  The  length  of  the  time-shifting 
evaluation  window  for  the  fault  detection  scheme  (i.e..  Equation  (25)  with  the  square  operation 
replaced  by  the  absolute  value)  is  pre-selected  as  5  and  the  system  response  under  the  nominal 
controller  is  tested  using  this  criterion  with  the  presence  of  measurement  noises.  The  on-line 
fault  detection  threshold  value  is  decided  as  100  based  upon  the  testing  results.  Under  the 
unanticipated  failures,  the  system  response  can  be  approximated  by  Equation  (40), 

w(k  + 1)  =  fiineaMk)  +  +  F(Torque(k)) ,  (40) 

where  F  denotes  the  unknown  effect  that  changes  the  motor  rotational  speed  due  to  the 
unanticipated  workload,  Torque{k) .  To  reduce  the  negative  effect  of  noisy  measurements,  the 
approximation  target  (i.e.,  numerical  value  of  F )  is  computed  based  upon  the  average  of  the 
differences  between  the  nominal  model  outputs  and  the  actual  speed  readings  every  10  time 
steps.  A  1 -5-5-1  MLP  network  is  used  to  approximate  the  unknown  failure  effect,  F ,  on-line 
with  the  static  backpropagation  algorithm.  Two  real-time  experiments  with  different  desired 
trajectories  and  unanticipated  faults  are  presented  here  to  test  the  proposed  fault  accommodation 
technique.  Each  real-time  experiment  is  complete  within  5,000  time  steps.  The  design  parameters 
of  the  learning  result  criterion  (i.e..  Equations  (19)-(24))  for  the  alternative  corrective  control  law 
■are  /  =  20  and  8  =  10. 
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6J_  Experiment  1 


The  control  objective  in  this  experiment  is  to  maintain  constant  rotational  speed  at  500  rpm. 
The  unknown  and  unanticipated  failure  is  generated  by  on-line  adjusting  the  break  dial  on  the 
front  panel  of  the  dynamometer  controller/readout.  The  actual  time-varying  workload  on  the 
motor  is  unknown  with  the  torque  reading  18-30%  shown  on  the  front  panel  of  the  dynamometer 
controller/readout  while  the  break  is  hand  adjusting  during  the  real-time  experiment.  Figure  12  is 
the  real-time  system  behavior  plot  under  the  nominal  controller.  Due  to  the  increased  workload, 
the  motor  rotational  speed  almost  reaches  zero  from  time  step  2500  to  3850.  On  the  other  hand, 
successful  fault  tolerant  mission  has  been  accomplished  through  the  proposed  on-line  fault 
accoiirmodation  technique  as  shown  in  Figures  13  and  14. 
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Figure  15  On-line  system  behavior  under  nominal  controller  only  (experiment  2) 


Figure  16  On-line  system  behavior  under  the  first  control  law  (experiment  2) 


5.2  Experiment  2 


The  desired  trajectory  in  this  experiment  is  selected  as  a  sinusoid  curve  generated  by  a  linear 
model  with  a  reference  input  as  shown  in  Equation  (41), 
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(41) 


]rfr 

ref  (k)  =  60  x  sin( - )  +  60, 

1000 

w desired  if +  \)  =  0.6 wdei.rerf  (k)  +  0.2 xvdesired  (k  - 1)  +  ref  (k). 

The  unknown  workload  used  to  generate  the  simulated  unanticipated  faults  ranges  from  15%  to 
28%.  The  system  behavior  under  the  failures  with  the  nominal  controller  alone  is  plotted  in 
Figure  15.  As  clearly  shown,  the  performance  has  been  significantly  degraded  and  the  rotation 
actually  stops  during  the  time  periods,  from  time  step  1300  to  1700  and  3200  to  4000,  due  to  the 
relatively  large  unanticipated  workload.  Figures  16  and  17  show  the  satisfactory  real-time  fault 
accommodation  when  the  firsi  and  the  alternative  corrective  control  techniques  are  applied, 
respectively. 

Q. 
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Figure  17  On-line  system  behavior  under  the  alternative  corrective  control  law  (experiment  2) 

7.  Conclusion 

In  this  paper,  the  on-line  fault  tolerant  control  problems  under  unanticipated  system  failures 
are  investigated  form  a  realistic  point  of  view.  Through  the  discrete-time  Laypunov  stability 
theory,  the  necessary  and  sufficient  conditions  to  guarantee  the  system  on-line  stability  and 
performance  under  failures  are  derived  under  no  specific  assumption  of  the  system  dynamic 
structure  and  failure  scenarios.  An  on-line  fault  accommodation  control  strategy  that  contains 
detail  off-line  design  procedure,  an  efficient  on-line  fault  detection  scheme,  and  effective  control 
law  reconfiguration  technique  is  presented  for  the  FTC  problems  of  interest.  Because  of  its 
capabilities  of  self-optimization  and  on-line  adaptation.  Artificial  Neural  Network  is  used  in  this 
research  work  as  the  on-line  estimator  for  the  unknown  failure  dynamics. 

The  effective  control  inputs  to  accommodate  system  failures  are  automatically  computed  on¬ 
line  by  the  control  regulator  through  the  realization  of  the  NN  estimator  based  upon  only  partial 
available  information  of  the  failure  dynamics.  The  price  paid  for  this  achievement  of  the 
successful  control  mission  relies  on  a  certain  degree  of  computational  expense.  Under  the 
suggested  on-line  fault  detection  scheme,  the  miss  detection  of  failures  becomes  trivial  while  the 
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possibility  of  false  alarm  situations  increases.  Due  to  space  limitation,  the  simulation  tests  under 
noisy  measurements  and  false  alarms  are  omitted.  However,  it  is  recommended  that  a  noise 
reduction  or  cancellation  process  be  used  for  better  system  performance  in  real  applications,  if 
any  prior  information  concerning  the  statistical  property  of  the  noise  is  available  since  the 
contaminated  noisy  measurements  will  mislead  the  interpretation  of  the  system  behaviors  and  the 
on-line  estimator.  In  general,  the  effectiveness  of  the  developed  on-line  fault  accommodation 
control  technique  for  catastrophic  system  failures  has  been  validated  through  on-line  simulation 
tests.  The  successful  on-line  fault  tolerance  in  real  applications  has  also  been  demonstrated 
through  real-time  hardware  experiments  with  the  presences  of  measurement  noises  and  unknown 
unanticipated  faults.  The  on-line  simulation  speed  can  reach  2-3  time  steps  per  second  under  the 
Intel  Pentium-H  450  dual  processors.  Real-time  experimental  results  also  indicate  that  a  more 
powerful  computing  device  such  as  a  computer  with  higher  speed  dual  processors  is  mandatory 
for  the  on-line  real-time  fault  tolerant  control  in  the  real  applications  under  the  more  general 
formulations.  Although  the  currently  used  dual  processors  may  not  be  fast  enough  in  many  real¬ 
time  control  systems  that  require  higher  sampling  rate,  it  is  believed  that  with  the  continuous 
performance  improvement  of  microprocessors  and  semiconductor  technology,  the  developed  on¬ 
line  fault  accommodation  technique  can  be  implemented  on-line  in  most  of  the  real-time  control 
systems  in  near  future. 

Current  research  work  is  focusing  on  extending  the  on-line  fault  tolerant  control  technique 
under  conflicting  failure  situations  in  MIMO  cases,  where  the  accommodations  of  some  failures 
may  require  a  certain  degree  of  compromise  in  other  objectives.  Under  these  situations;  the 
control  problems  are  both  theoretically  and  technically  complicated  since  the  reconfiguration  of 
the  control  actions  becomes  a  multi-objective  optimization  in  the  sense  of  Pareto  optimality. 
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APPENDIX 


Assumptions: 

1.  The  nominal  model,  A[y(£  +  1)»  is  accurate  and  precise  enough  such  that  Ny(k  +  \),  the 
remaining  uncertainty  of  the  nominal  system,  is  bounded  by  sup  {jAy(A:-i-l)|},  where  T  is 

wt>7y  * 

the  starting  time  step  that  the  control  input  starts  accommodating  the  failures.  (Note  that  this 
constraint  can  be  possibly  relaxed  since  the  nominal  model  can  be  obtained  off-line  using  all 
existing  modeling  techniques.) 

2.  The  remaining  uncertainty  of  the  failure  dynamics,  nfy{k  +  1) ,  is  the  residue  resulting  from 
the  difference  between  the  actual  fy(k  + 1)  and  the  best  estimation  of  the  on-line  estimator 
and  it  is  bound  by  the  least  upper  bound,  sup  \nfy(Jk  + 1)1} . 

v<t>7y M 

3.  The  error  caused  by  the  optimization  algorithm  is  bounded  by  sup  {|A£>ror(&)|} . 

v*>rr 

Proof: 

Let  A Error(k)  represent  the  error  after  the  searching  effort  of  the  optimization  algorithm.  Then, 

AError(k)  =  Desire(k)  -  Ny(k  + 1)  -  nfy(k  + 1) . 

By  Equation  (11), 

(k)(a  +  ^T'  -  A Errorfk)  =  Ny(k  + 1)  +  nfy{k  + 1) .  (42) 

For  S{k)>  0: 

Plugging  Equations  (42)  and  (10)  into  the  inequality  (8),  we  have 

(F(£)  +  S(k))(a  +  j-)~'  >  Y ( k){a  +  -^-)~1  -  A Errorlk)  +  Ny(k  + 1)  +  nfy(k  + 1)  >  (F(jfc)  -  S(k)){a  +  — )_1 . 
Simplifing  the  inequality,  we  get 
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S(k)  >  [iVy(A  + 1)  +  njy(k  + 1)  -  AError{k)\a  +  — ) ,  and 

At 

-S(k)<  [Ay  (A  + 1)  +  njy{k  + 1)  -  A Error (k)\a  +  .  (43) 

Since  S(k )  >  0 ,  -  S(k)  <  sup  |  Ay  (A  + 1)| }  +  sup  | nfy(k  + 1)1}  +  sup  {jA£>ror(A)|}  (a  +  — )  is 

Lv*>7>  v*>7>  v*>r/  UJ  At 

always  trae.  By  assumptions  1,  2,  and  3,  the  following  inequalities  will  hold  for  the  worst 
condition' 


s(k)  >  sup  |  Ay  (A  + 1)|}  +  sup  \njy(k  + 1)|}+  sup  |AEm?r(A)|}  (a  +  — ) .  (44) 


Apparently, 


sup  {j-A/y (A:  + 1)|}+  sup  j nfy{k  + 1)|}+  sup  \AError(k)\\  {a  +  — )  =  inf  {5(A)}, 
*k>T/  v*>7>  vk>rf  At  v*>r7l  v  ’’ 


which  is  the  greatest  lower  bound  of  5(A) 


-  sup |A5’(£  +  l)|}+  sup|«$’(A  +  l)|}+  sup  |a£/?w(£)|}  (a  +  -t-)=  sup {-5(A)}  which  is  the 

VA>77  VJc>T,  1  Ki  v  J 


least  upper  bound  of  -5(A),  and  since  5(A)>5(A  +  1)  and  -  5(A)  < -5(A  + 1) ,  the  following 
inequalities  will  always  hold 

sup  jAy(A  + 1)|}+  sup  \njy(k  + 1)|}+  sup  {jA£m>r(A)|}  L  +  — )  >  5(A  + 1) ,  and 

V*>7>  Vk>T,  WKT.  N  M  At7  v  '  5 


-  sup  |Ay(A  + 1)|}+  sup  \nfy{k  + 1)|}+  sup  Ia£/7w(A)|}  (a  +  — )  <  -5(A  + 1) ,  (45) 

_V*>7>  V*>J>  V*>7)  V  At  K 

for  both  situations,  5(A  + 1)  >  0  and  5(A  + 1)  <  0 ,  which  implies 

2  <  5(A  +  1)  <  E,  (46) 

For  5VA)  <  0 : 

Plugging  Equations  (42)  and  (10)  into  the  inequality  (3.9),  we  have 

(f(A)  -  S(k))(a  +  j-y'  >  Y(k)(a  +  -J-)"1  -  AError(k)  +  Ay  (A  + 1)  +  nfy{k  + 1)  >  (F(k)  +  S(k))(a  +  —  r1 . 

At 

Simplifing  the  inequality,  we  get 

-  5(A)  >  [Ay  (A  + 1)  +  nfy(k  + 1)  -  AError(k)\a  +  — ) ,  and 


5(A)  <  [Ay(A  + 1)  +  njy(k  + 1)  -  AError(k)\a  +  — ) 


Since  5(A)  <  0 ,  5(A)  <  sup  { Ay (A  + 1)|}  +  sup  \njy(k  + 1)1}  +  sup  Ia£'/to/-(A)|}  (a  +  — )  is 

yk>Tf  Vk>T{  Vk>T,  u  At 

always  true.  By  assumptions  1,  2,  and  3,  the  following  inequalities  will  hold  for  the  worst 
condition. 
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-  S(k)  > 


sup  {Ay(£  +  l)|}+  sup  \nfy(k  + 1)1}  +  sup  \hError(k)\\ 
yk>Tf  v*>r/  v*>?/  |j 

sup  |A5?(A:  + 1)|}+  sup  \njy{k  + 1)1}  +  sup  \AError(k)§ 
*k>T/  v*>j>  1  v*>j r  U 


l(a  +  T~)  • 
Ar 


(48) 


and- 


(C  +  J_)=  inf  {-$(*)}, 
A  t  v*>r/ 


Apparently, 

which  is  the  greatest  lower  bound  of  -S(k) 

sup ^A5?(A  + 1)|}+  sup $nfy(k  + 1)|}  +  sup {a&tw(*)|}  (u  +  A)=  sup{S(i)},  which  is  the 
?k>Tr  vk>,,  v*>7>  '  J  A/  v*>7> 

least  upper  bound  of  S(k),  and  since  -S(k)>S(k  + 1)  and  S(k)  <  -S(k  + 1) ,  the  following 
inequalities  will  always  hold 


sup  jA57(*  + 1)|}+  sup  \njy(k  +  1)|}+  sup  {A£m>r(/t)|} 

yk>Tf  Vk>T,  v*>r/  |J 

sup  |jVy(£  + 1)|}+  sup  \nfy(k  + 1)|}+  sup  |)AError(/t)|} 

yk>T/  vk>T/  Vk>Tf  ‘ 

for  both  situations,  S(k  +  1)>0  and  S(*  +  1)<0,  which  implies 


(tf  +  -7-)  ^  S(k  + 1) ,  and 
At 

(a+-J-)£-S(*  + 1), 

At 


(49) 


S  <  S(k  + 1)  <  E ,  (50) 

We  get  exactly  the  same  result  as  Equation  (46).  Thus,  the  S  function  is  bounded  by  the  value 
defined  by  the  least  upper  bounds  of  the  remaining  uncertainty  of  the  nominal  system  the 
residue  of  the  on-line  estimator,  and  the  error  by  the  optimization  algorithm. 


Q.E.D. 


29 


APPENDIX  G: 


Multiple  Model  Approach  by  Orthonormal  Bases 
For  Controller  Design 

by 

Gary  G.  Yen  and  Seok-Beom  Lee 
Submitted  to 

International  Journal  of  Control 


MULTIPLE  MODEL  APPROACH  BY  ORTHONORMAL  BASES 
FOR  CONTROLLER  DESIGN* 


Gary  G.  Yen  Soek-Beom  Lee 

Oklahoma  State  University 
School  of  Electrical  and  Computer  Engineering 
Intelligent  Systems  and  Control  Laboratory 
Stillwater,  OK  74078 


Abstract 


Recently,  model-free  or  data-driven  control  has  been  gaining  great  interest  in  overcoming  the 
limitations  of  conventional  model  based  control  methodologies.  However,  the  existing  data- 
driven  control  is  far  from  practical  because  of  its  slow  convergence,  severe  computational 
burden,  and  lack  of  analysis  and  synthesis  tools.  This  paper  was  motivated  in  response  to.  these 
deficiencies  of  data-driven  control  methodologies.  A  multiple  modeling  method  for  subsequent 
robust  controller  design  is  proposed.  This  method  is  developed  to  overcome  the  difficulties  of 
utilizing  the  existing  multiple  model  approaches  for  control  purpose  by  adopting  a  different 
locality  concept  based  on  dominant  poles.  The  dominant  poles  are  represented  by  Laguerre  basis 
functions.  Two  examples  are  included  to  demonstrate  the  feasibility  and  characteristics  of  the 
proposed  identification  approach. 


*  This  work  was  supported  in  part  by  the  U.S.  Air  Force  Office  of  Scientific  Research  under  Grant  F49620-98-1- 
0049  and  National  Science  Foundation,  Measurement  and  Control  Engineering  Center. 


1.  Introduction 


When  we  are  given  a  complex  problem  to  solve,  one  of  our  instinctive  approaches  is  to 
divide  the  problem  into  smaller  and  more  manageable  ones.  In  this  spirit,  multiple  model  control 
[Murray-Smith  and  Johansen,  1997]  is  an  intuitive  approach  to  solve  complex  control  problems 
because  it  is  originated  from  the  same  philosophy,  the  so  called  “divide  and  conquer  strategy.” 
From  the  control  point  of  view,  modeling  is  a  very  important  procedure  that  has  to  be  performed 
before  designing  controllers.  In  spite  of  the  importance  of  modeling,  there  is  no  universal 
modeling  method,  therefore,  it  is  not  surprising  that  control  engineers  have  to  spend  a  good  deal 
of  time  just  to  obtain  a  proper  model  for  controller  design.  Another  limitation  of  model  based 
control  is  that  model  based  control  can  only  handle  the  environments  considered  beforehand. 
There  are  many  cases  in  which  a  model  cannot  incorporate  all  possible  scenarios  such  as 
unanticipated  failure  modes  and  structural  autonomy.  Adaptive  control  may  be  the  solution  for 
this  case;  however,  most  model  based  adaptive  controllers  can  only  handle  a  certain  degree  of 
uncertainties  [AstrSm  and  Wittenmark,  1995], 

To  relieve  the  difficulties  in  modeling,  linear  system  identification  methods  are  available, 
however,  linear  models  can  represent  a  system  only  in  small  operating  regimes.  Neural  networks 
have  been  suggested  as  a  universal  framework  for  nonlinear  system  identification,  but  the 
structures  of  neural  networks  are  different  from  the  ones  that  have  been  studied  for  nonlinear 
controller  design.  Also,  training  a  neural  network  for  an  entire  system  under  various  operating 
conditions  is  not  trivial,  if  not  at  all  impossible.  Bearing  in  mind  these  problems,  multiple  model 
control  has  been  suggested  as  an  alternative  to  handle  a  complex  nonlinear  system  by 
decomposing  it  into  more  tractable  subsystems  and  then  integrating  them  together. 

Even  though  the  multiple  model  control  mentioned  above  is  concerned  with  the  case  where  a 
global  analytical  model  is  not  available,  multiple  model  control  can  be  also  beneficial  when  a 
global  nonlinear  model  is  available.  The  reason  is  that  nonlinear  control  design  is  still  an 
evolving  subject  and  the  structures  fit  for  nonlinear  controller  design  are  very  restricted  [Haber 
and  Unbehauen,  1990].  In  many  cases,  complex  nonlinear  systems  can  reduce  to  more  tractable 
systems  in  local  regions;  hence  local  controller  design  for  the  local  environment  is  relatively 
simpler  than  a  global  nonlinear  controller  design.  Another  motivation  for  multiple  model  control 
is  that  it  can  handle  rapidly  changing  system  dynamics  or  operating  environments.  Adaptive  and 
robust  control  have  also  been  suggested  for  these  purposes  and  are  under  extensive  investigation. 
However,  switching,  tuning,  or  interpolating  action  of  multiple  model  control  is  expected  to 
provide  faster  and  more  reliable  control  than  other  control  methods  [Narendra  et  al.,  1995]. 

Because  the  idea  of  multiple  model  control  comes  naturally,  it  is  not  surprising  that  multiple 
model  control  shares  a  common  ground  with  gain  scheduling  [Apkarian  and  Adams,  1998],  and 
hybrid  systems  [Antsaklis,  1998]  to  some  extent.  Also,  similar  approaches  can  be  found  in 
statistics  and  intelligent  systems  tinder  different  names  such  as  local  nonparametric  method  and 
lazy  learning  [Atkeson  et  al .,  1997a-b].  Compared  to  other  control  strategies,  we  use  multiple 
model  control  as  a  general  term  to  embrace  all  the  similar  methods. 

As  mentioned  previously,  multiple  model  control  finds  its  benefits  whether  a  theoretical 
model  is  available  or  not.  However,  the  multiple  model  control  approach  with  available  models 
is  usually  problem  dependent  and  will  be  the  subject  of  nonlinear  control  theory  [Kaplan  and 
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Glass.  1995].  Therefore,  this  stud}7  is  concerned  with  the  cases  without  a  proper  theoretical 
model  and  is  limited  to  empirical  modeling  process  and  control.  Even  though  the  ideas  and 
expected  benefits  of  multiple  model  control  are  natural  and  attractive,  there  are  many  technical 
details  that  have  to  be  resolved  to  make  multiple  model  control  practical.  The  following  is  a  brief 
list  of  the  issues  of  empirical  multiple  model  control  that  have  to  be  resolved: 

1 .  defining  locality'  (how  to  distribute  or  partition  a  system); 

2.  identification  of  subsystems; 

3.  switching,  interpolation,  and  extrapolation  strategy; 

4.  local  controller  design  and  validation; 

5.  realizing  global  stability  and  performance; 

6.  on-line  adaptation  of  models  and  controllers;  and 

7.  implementation  issues. 

The  first  three  items  are  regarding  modeling.  Validation  of  local  models  is  necessary'  to 
verify  the  local  models  and  estimate  local  uncertainties  for  robustness  of  the  subsystems.  The 
second  three  items  are  regarding  analysis  and  synthesis  of  multiple  model  based  control.  The 
implementation  issues  such  as  computational  complexity  and  the  effect  of  quantization  error 
must  be  addressed  to  realize  a  practical  multiple  model  control. . 

As  a  first  stage  in  achieving  practical  multiple  model  control,  this  paper  emphasizes  the 
modeling  part.  The  main  concerns  are  the  empirical  modeling  from  observed  data  by  forming 
subsystems  and  coordinating  them  to  achieve  a  good  global  model  with  estimated  quality 
description.  Most  multiple  model  techniques  lack  rigorous  analysis  of  model  quality.  For  robust 
controller  design,  which  is  inevitable  for  data-driven  control  because  of  model  error  caused  by 
noisy  and  biased  data,  model  quality  description  with  limited  complexity  of  nominal  model  is 
essential.  In  Section  2,  a  multiple  modeling  algorithm  based  on  orthonormal  bases  and 
uncertainty  estimation  is  developed  to  overcome  the  problems  of  the  existing  multiple  model 
approaches  to  controller  design.  T o  demonstrate  the  use  and  characteristics  of  existing  methods 
and  the  new  development,  simulation  study  is  included  in  Section  3.  The  simulation  study  shows 
the  promising  results  of  multiple  modeling  approaches.  In  Section  4,  conclusions  are  drawn  in 
providing  relevant  observations  and  future  research  directions. 

2.  Multiple  Model  Approach  by  Orthonormal  Bases 

Multiple  modeling  may  be  seen  as  a  branch  of  nonlinear  system  identification.  Comparable 
to  global  methods  such  as  neural  networks,  multiple  modeling  divides  a  system  into  subsystems 
One  driving  force  of  decomposition  is  so  called  temporal  crosstalk  [Jordan  and  Jacobs,  1994], 
which  makes  once  trained  global  models  forget  what  was  learned  previously.  Also,  the 
identification  of  subsystems  is  more  efficient  and  transparent  than  with  global  methods.  The 
heart  of  multiple  modeling  is  to  divide  the  system,  identify  local  models  and  then  combine  them. 
Approaches  to  multiple  modeling  can  be  roughly  divided  into  two  vategories:  probabilistic  and 
non-probabilistic.  Non-probabilistic  approaches  [Narendra,  1996],  [Narendra  and 
Mukhopadhyay,  1997],  [Principe  et  al,  1998]  are  based  on  the  prediction  error  or  the  geometric 
information  of  data  while  probabilistic  approaches  [Anderson,  1985],  [Kadirkamanathan  and 
Fabri,  1998]  are  based  on  the  assumed  probability  of  the  system. 
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As  seen  from  extensive  literature  reviews,  several  multiple  modeling  methods  have  been 
proposed  as  alternatives  to  model  complex  nonlinear  systems.  However,  a  short  glimpse  of  the 
proposed  methods  soon  reveals  that  they  lack  a  theme,  that  is,  the  purpose  of  the  models.  To 
remind  us  that  a  model  is  merely  a  suitable  description  of  representing  physical  phenomena, 
models  without  designated  purpose  can  hardly  be  useful  in  practice.  In  this  study,  the  focus  is  on 
developing  models  for  designing  robust  controllers.  By  putting  this  purpose  in  a  modeling 
process,  our  model  has  to  have  certain  constraints:  limited  complexity  and  explicit  representation 
of  uncertainty  of  the  model.  Because  we  are  interested  in  data  driven  modeling  or  system 
identification,  uncertainty  representation  should  also  depend  on  estimation  instead  of  derivation 
from  physical  interpretations.  Because  of  several  benefits  of  orthonormal  basis,  orthonormal 
basis  functions  are  usually  used  for  system  identification  for  robust  control  [Bodin  et  al.,  1997], 
A  review  of  representative  orthonormal  basis  function  in  literature  is  given  first.  Motivated  from 
the  reviewed  uncertainty  estimation  techniques  and  orthonormal  basis  functions,  a  new  multiple 
modeling  method  is  then  proposed. 

2.1  Orthonormal  bases 

As  seen  from  most  nonlinear  system  identification  framework,  a  regression  vector  usually 
consists  of  a  series  of  past  inputs  and  outputs.  However,  this  may  cause  propagation  of 
estimation  error  and  sometimes  causes  instability  of  models  [Nelles,  1998].  For  this  reason,  only 
input  dependent  regressors  are  attractive.  The  problem  with  only  input  dependent  regressors 
such  as  the  FIR  model  is  feat  they  require  very  high  order  models  to  achieve  a  satisfactory  result. 
For  this  reason,  filtered  inputs  instead  of  pure  inputs  can  be  used  to  reduce  fee  complexity  of  fee 
models.  Recently,  orfeonormal  basis  functions  have  attracted  a  great  interest.  By  using 
orfeonormal  basis  functions,  parameter  estimation  is  treated  as  a  well-known  linear  least  square 
approximation.  So,  it  accelerates  fee  estimation  of  parameters,  avoids  fee  convergence  to  local 
minimum,  and  facilitates  fee  analysis  of  fee  model  properties.  In  [Wahlberg,  1994],  an 
orfeonormal  basis  function  is  derived  by  an  optimization  solution  in  the  sense  of  n-width 
measure.  The  n-width  measures  fee  smallest  approximation  error  for  fee  worst  possible  system 
using  the  best  possible  n-dimensional  linear-in-fee-parameters  models  set.  The  n-width  measure 
is  formally  defined  as  [T offrier-Clausen,  1996]: 

Definition  1  ( n-width  measures) 

Assume  that  we  know  that  fee  system  G  belongs  to  a  given  bounded  set  S,  Ge  S ,  then  n-width 
measure  is  defined  as: 

dn(S\B)=  inf  sup  inf  ||G-GJb  (1) 

where  B  denotes  some  Banach  space  with  norm  1 1  ,  e.g.,  H2  or  ,  Mn  (B)  denotes  fee 
collection  of  all  n-dimensional  linear  subspaces  of  B,  'F/  spans  and  Gn  =  j  .  S  is  a 

priori  given  bounded  set,  which  is  GeS.  The  innermost  term  e® (G)  =  infG €(t>  ||G  —  G„||£ 
denotes  fee  best  achievable  model  errors  for  G  by  elements  in  .  Let  fe  =  {<!>,,}  a  denotes  fee 
corresponding  sequence  of  subspaces.  $  is  called  complete  if  for  any  G  e  B ,  e®  (G)  -40  as 
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n-+°°.  If  dn(S:B)  =  supG_<r inf  . |G-G„|  ,  then  <5*  is  called  an  optimal  subspace  for 
dn(S;B). 


For  exponentially  stable  systems  (analytic  in  \z\  >  R ),  FIR  model  are  proven  to  be  optimal  in 

the  n-width  sense  for  H2  and  H„  [Pinkus,  1985],  However,  if  the  information  about  the 
dominating  poles  is  available,  then  the  Lagueire  basis  function  is  proven  to  be  optimal  for 
systems  G(s)e  H2(C,  C) ,  which  are  analytic  outside  the  domain  \s+a\  <  r,  a  >r  [Wahlberg, 


1994],  Here  in  H2  (■,•) ,  the  first  argument  is  the  domain  of  G(-)  and  the  second  is  the  range.  This 
corresponds  to  a  situation  where  we  know  the  dominating  poles  of  the  system  belong  to  the  set 
Ls+cd<r.  Also,  it  is  proven  that  the  space  spanned  by  the  Kautz  functions,  k  =  l,--,n  with 


b  =  fa2  -  r2  ,  is  an  optimal  2 -dimensional  subspace  in  the  n-width  sense  for  function 
G{s)  e  H1  (C,  C) ,  which  are  analytic  outside  the  domain  [(.S'2  +  as  +  c)  /  2|  <  r,  r  <  a  [Wahlberg, 


1994],  Hence,  Laguerre  basis  function  is  often  used  to  approximate  well-damped  systems  while 
Kautz  basis  function  is  used  for  approximating  lightly  damped  systems.  Therefore,  the 
combination  of  Laguerre  and  Kautz  basis  functions  can  approximate  any  linear  systems  very 
well.  The  following  theorem  for  continuous-time  Laguerre  functions  is  cited  from  [Wahlberg, 
1994]. 


Theorem  1  ( n-width  of  continuous-time  Laguerre  models) 
The  space  <E>*  spanned  by  the  Laguerre  functions 


s  +  a 


Lj(s,a)  = 


-Jla 


s  +  a 


with  Laguerre  pole,  a -fa2  -r2  is  an  optimal  ^-dimensional  subspace  in  the  n-width  sense  for 
functions  G(s)  e  H2  (C,  C) ,  which  are  analytic  outside  the  domain  \s +cc|  <  r,  a>r . 


In  addition,  there  is  a  corresponding  theorem  for  discrete-time  Laguerre  functions  [Toffiier- 
Clausen,  1996], 


Theorem  2  ( n-width  of  discrete-time  Laguerre  models) 

The  space  <t>*  spanned  by  the  discrete-time  Laguerre  basis  functions 

j- 1 

,  j  =!,■■■, n. 


W<1 


Lj(z,a)  = 


f\-a2 


z-a 


1  —  az 


z-a 


with  Laguerre  pole 


a  = 


— (1  +  a2  -r2  + 
2a 


V(  l  +  «2-r2)2 


(3) 

(4) 
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is  an  optimal  ^-dimensional  subspace  in  the  n-wzbfth  sense  for  functions  G(z)  s  FT,  ( C ,  C)  which 
are  analytic  outside  of  the  domain  \z  -  al  <  r  with  r  >  |a|  -f  1 . 


The  realization  of  Laguerre  basis  functions  is  represented  as  an  expansion  as 
y(t)  =  QjL,  (z,  a)u(t)  ,  which  can  be  considered  as  filtered  inputs  through  a  low  pass  filter 

and  a  sequence  of  all  pass  filters  (as  shown  in  Figure  1).  The  problem  of  using  these  orthonormal 
basis  functions  is  that  it  requires  a  prior  knowledge  about  the  system  poles.  A  crude  estimation 
can  be  obtained  from  step  or  impulse  response  of  the  system  or  from  a  first  order  ARX  model. 
More  optimal  poles  can  be  obtained  from  numerical  search  [Malti  et  al,  1998].  In  [Oliveira  e 
Silva,  1995],  Taylor  series  are  used  to  find  the  optimal  poles  by  utilizing  a  certain  property  of 
Laguerre  filters  in  optimal  poles.  Study  regarding  more  general  orthonormal  filters  can  be  found 
in  [Bodin  et  al,  1997]. 


y(t) 


Figure  1:  Laguerre  network. 

(LP  denotes  a  first  order  low-pass  while  AP  denotes  a  first  order  all-pass  filter.) 

2.2  Nonlinear  system  identification  with  Laguerre  bases 

As  pointed  out  earlier,  the  existing  multiple  modeling  methods  lack  quality  descriptions.  As  a 
matter  of  fact,  deciding  the  interpolating  or  switching  variables  in  the  multiple  models  is  a 
relatively  straightforward  task  once  local  models  are  identified.  For  example,  for  a  single-input- 
single-output  system,  the  multiple  model  maybe  written  as  the  following: 


m 

(5) 

i=l 

where  j);  and  w  are  a  local  model  output  and  a  weight  for  the  z’th  model,  respectively.  As  can  be 

easily  seen,  the  weights  are  linear  in  parameter  and  can  be  easily  updated  by  recursive  least 
square  or  Kalman  filter.  Therefore,  the  more  demanding  task  in  multiple  modeling  is 
identification  of  local  models. 

Intuitively,  it  is  beneficial  to  have  a  constructive  method  to  define  locality  of  the  models.  The 
literature  review  of  orthonormal  functions  such  as  Laguerre  basis  functions  reveals  that  they  are 
optimal  in  approximating  a  linear  system  with  certain  knowledge  of  pole  locations.  Also, 
nonlinear  systems  can  be  approximated  by  linear  systems  in  a  local  sense.  Hence,  it  is  natural  to 
explore  orthononnal  basis  functions  for  nonlinear  system  identification  with  time  varying  or 
system  dependent  parameters.  Another  motivation  of  using  orthonormal  basis  functions  is  that 
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the  model  maintains  the  linear-in-parameter  property.  This  makes  the  analysis  and  synthesis  of 
the  model  tractable  such  that  the  uncertainty  bounds  for  robust  control  can  be  estimated  from  the 
bias  and  variance  information  of  the  estimated  parameters. 

In  this  section,  an  innovative  algorithm  is  proposed  which  utilizes  multiple  Laguerre  bases 
for  nonlinear  system  identification.  The  issues  resolved  are  the  selection  of  pole  locations  for 
each  Laguerre  basis  and  parameter  estimation.  For  simplicity,  the  algorithm  is  limited  to  single- 
input  single-output  systems.  The  initial  thought  was  to  select  the  Laguerre  poles  to  be 
orthononnal  to  each  other  such  that  each  local  model  spanned  by  each  Laguerre  basis  become 
orthonormal.  By  this,  the  locality  of  local  models  does  not  depend  on  distance  metric  in 
regression  space  or  on  prediction  errors.  For  example,  consider  the  following  system: 

yQ)  =  fOK  0)+v(0 

where  y(t) ,  v(t)  and  <p(t)  are  output,  measurement  noise  and  regression  vector,  respectively. 
The  model  by  multiple  Laguerre  bases  can  be  written  as  the  following: 

y(t)  =  (A \q)0l  +  A2(g)d2  +  -  +  A"  (q)6‘ >(?) 

=  <t>'(t)el  +<p2(t)e2  +-+4>n(t)dn 

=  yl(t)+?(t)+-+yn(t) 

where  AJ  is  a  local  model  spanned  by yth  Laguerre  basis,  <pJ  is  a  local  regression  vector.  Each 
local  model  is  expanded  as  yJ  (?)  =  At (q)6fu(t) . 

Then  each  local  model  can  be  identified  simply  by  refitting  residuals  to  least  square 
estimation  until  the  desired  accuracy  is  achieved.  This  process  holds  many  advantages  over  the 
existing  multiple  model  approaches:  (1)  a  constructive  modeling  process  to  decide  the  number  of 
local  models  and  local  model  structures,  (2)  natural  estimation  of  global  uncertainty  bounds  by 

y(t)  =  {/V1  (q)(6 1  +  A1 )  +  A2 (,q){6 2  +  A2 )  +  -  +  Am (q)(6 m+Am)  +  A(q)]u(t)  where  A'  is  a  real 
vector  from  covariance  of  parameter  estimation  and  A (q)  is  a  complex  function  from 
undermodeling.  However,  it  soon  turned  out  that  selecting  Laguerre  poles  to  be  orthonormal  to 
each  other  is  not  a  trivial  process  without  knowledge  of  system  poles.  Therefore,  a  different, 
somewhat  heuristic,  approach  is  used  in  this  study. 

Since  we  do  not  use  any  assumptions  about  the  system,  the  poles  must  be  estimated  from  the 
data.  For  the  optimal  pole  locations,  a  numerical  search  based  on  the  Newton-Raphson  method  is 
adopted.  The  derivatives  can  be  computed  in  closed  forms  because  of  the  nice  property  of 
Laguerre  function.  The  derivation  is  similar  to  the  continuous  time  case  in  [Malti  et  al.,  1998], 
but  is  extended  into  the  discrete-time  Laguerre  basis  here.  Consider  a  local  Laguerre  basis 
function  expansion: 


y^t)  =  =^AJ,(9)e/u(() 

1=1 


(6) 
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where  A'  (q)  - 


jl-(aJ)2 


q-aJ 


\  —  aJq 
q  —  aj 


=  and  \aJ  <i.  aJ  is  the  pole  of  the yth  local 


Laguerre  basis  set.  Define  a  cost  function  to  be  the  following: 


(7) 

where  e  =  Y-X6\  0/  -  0'],  Y  =  \y(  1)  y(X,  •••  >«')]' ,  and 

X  =  [n(l)  u(2)  -  A\(q)  -  A'„(9)]  =  E/[a'(?)  A'(9)  -  A'.(?)]. 


The  Newton-Raphson  method  to  estimate  local  Laguerre  pole  is  given  as: 


fn  2 


d(aJ) 


daJ 


(B) 


where  dj/daJ  and  d2j/d(aJ)2  are  evaluated  at  a'n ,  jl  is  the  step  size.  The  derivative  of  J  with 
respect  to  aj  is  the  following: 


dJ  ,  de  , 
d^~e  d^~6 


daJ  daJ 


(9) 


j 


The  parameter  vector  6J  can  be  written  in  a  closed  form  solution,  the  so  called  normal  equation: 

6J  =(X'X)-1X'Y.  (10) 

The  partial  derivative  of  the  normal  equation  can  be  written  as: 


de_ 

da 


t=&'xV 


MLr-' 

daj 


*Lx+r*\> 

daJ  daJ  ^ 


d6J 

The  second  term  of  (29),  e'X - - ,  becomes  zero  since 

daJ 

eJ =(x1xy]x'Y  =  (x,xy1x'(xej  +e)=eJ  +(x'xy]x'e. 


(11) 


Hence,  (X' X)~]  X' e  =  0  and  from  (11),  (9)  becomes  the  following: 


daJ  daJ 


(12) 
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The  second  derivative  of  J  with  respect  to  a’  is: 


-LL  .  U<y  -e'f JXfl'  + 

d(aJ)2  I  daJ  daJ  \daj  3(£Zy)2  9c7  daJ 


>o,  the  remaining  equations  to  be  derived  are  the  first  and  second  partial  derivatives  of  X.  The 


derivation  is  done  by  the  following  Lagueire  filter  property: 


dM  _  zAJw(^)-(/-i)AJM(^) 

daJ  l-(aJ)2 


,  i  =  m. 


Therefore,  the  first  partial  derivative  of  Xis: 


daJ  1  ~{aJf 


where  (:,/)  means  the  rth  column  of  a  matrix.  The  second  partial  derivative  of  X is  the  following- 
i(i  +  l)Aj2  (aj)  +  2aJiAJl+{  {aJ )  - (2/ 2  -  2 i  +  1)AJ, (aJ ) 

d-(^)2)2 

,  -  °-cJ  0  - 1)  At,  (g J  )  +  Q-  !)(/  -  2)Aj3  (aJ  ) 
d-(^)2)2 

The  algorithm  of  estimating  the  optimal  pole  involves  estimating  parameters  by  (10)  and  then 
to  advancing  pole  estimation  using  (8),  (12),  (13),  (11),  (15),  and  (16).  This  process  can  be 
iterated  until  an  optimal  solution  is  obtained.  As  well  known,  optimisation  problems  can  get 
stuck  in  local  minimum,  therefore,  the  optimization  must  be  repeated  with  different  initial 
conditions.  This  algorithm  is  implemented  in  a  Matlab  function  in  this  study.  Using  the 
optimization  algorithm  derived  above,  Lagueire  poles  can  be  located.  As  mentioned  before, 
selection  of  Laguerre  poles  orthonormal  to  each  other  is  not  a  trivial  problem.  So,  a  set  of 
training  data  is  divided  into  multiple  sets  of  local  data  sets,  then  the  derived  optimization 
algorithm  in  the  above  is  applied.  If  the  estimated  poles  at  different  data  sets  are  different 
enough,  these  poles  are  considered  to  be  the  poles  for  local  Laguerre  bases.  This  procedure  is 
somewhat  heuristic,  however,  optimization  becomes  very  efficient  since  optimization  is 
performed  only  in  a  single  dimension  with  a  small  data  set.  A  moving  window  along  time 
horizon  can  also  be  used  instead  of  segmenting  dat z  set. 

2.3  Parameter  estimation  by  recursive  least  square  with  forgetting 

Since  the  multiple  models  are  linearly  parameterized,  the  estimation  of  parameters  is  very 
efficient.  In  this  study,  recursive  least  square  with  a  forgetting  factor  is  adopted  [Anderson, 
1985].  This  is  based  on  the  assumption  chat  a  nonlinear  system  can  be  approximated  by  a  linear 
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system  with  time  varying  parameters.  Therefore,  the  parameters  must  be  adapted  with  time  or 
states.  The  well-known  recursive  least  square  with  forgetting  factor  is  given  as: 


etM -e,  +p,x(k+Dat(y(k+i)-x(k+iye,) 


(17) 


where  ak  =  .  (  7  •  is  a  forgetting  factor,  X(A'-f-l)  is  a  regression  vector  at  k  + 1 

sequence,  y(£ +1)  is  a  output  at  k  + 1  sequence  and  ak  is  the  estimated  parameter  at  k  sequence. 
Pk  is  a  initially  large  number  and  is  recursive  updated  by  Pk+]  =  iq-  L2112 


2.4  The  proposed  identification  algorithm 

The  proposed  algorithm  consists  of  two  stages:  (1)  off-line  local  Laguerre  pole  estimation  by 
the  algorithm  derived  in  Subsection  2.2,  and  (2)  on-line  recursive  parameter  estimation  shown  in 
Subsection  2.3.  The  following  procedure  shows  the  identification  processes  involved: 

1 .  segment  the  data  set 

2.  find  a  Laguerre  pole  by  using  the  optimization  algorithm  using  (10),  (8),  (12),  (13), 

(11),  (15),  and  (16).  . . 

3.  if  the  identified  pole  is  different  enough,  accept  it  as  an  additional  system  pole 

4.  repeat  the  above  process  with  the  next  segmented  data  set 

5.  after  identifying  the  poles,  apply  the  parameter  estimation  algorithm  on-line  using 
(17). 


3.  Simulation  Study 

This  section  is  intended  to  verify  the  characteristics  of  the  proposed  algorithm  for  system 
identification  by  simulation  study.  Two  discrete  time  models  are  assumed  to  be  the  hue  systems 
that  generated  data:  (i)  a  SISO  linear  system,  (ii)  a  SISO  nonlinear  system.  They  are  quoted  from 
a  published  paper  [Stenman,  1999].  For  more  realistic  simulations,  outputs  are  corrupted  with 
Gaussian  random  noise.  The  selected  systems  are  listed  in  the  following. 

(i)  SISO  linear  system: 


y (0  - 1 .5 y(t  - 1)  +  0.7 yit  -  2)  =  u(t  - 1)  +  0.5«(f  -  2)  +  e(t) ; 
(ii)  SISO  nonlinear  system: 


(18) 


^  (r  +  l)  = 


*i  (0 

1 +xf(t) 


•+1 


:sin(x2  (/)) 


x2  (t  + 1)  =  x2  (i)  cos(x2  (t))  +  Xj  (t)  exp 


f  x,2(t)  +  x2(r)^ 


+• 


u\f) 


1  +  u (t)  +  0.5  cos(x,  (t)  +  x2  ( t )) 


(19) 


y(y)  = 


X,  (t) 


x2(r) 


l  +  0.5sin(x2  (t))  l  +  0.5sin(x,(t)) 


+  e(0- 
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The  models  considered  for  system  identification  are  the  following  four:  (1)  linear 
AutoRearession  with  eXtemal  inputs  (ARX)  model  (2)  Feedforward  neural  networks  (FFNN) 
based  ARX  model  [Narendra  et  al.,  1995],  (3)  Self-Organizing  Map  (SOM)  oriented  multiple 
model  [Principe  et  al. ,  1998],  and  (4)  the  multiple  model  with  multiple  Laguerre  bases  proposed. 

3.1  Identification  of  a  linear  discrete-time  system 

In  order  to  generate  data,  the  system  (i)  was  simulated  with  u{t)  and  Bit)  selected  as 
independent  and  Gaussian  distributed  sequences  with  zero  means  and  unit  variances.  The 
preliminary  regression  order  selection  algorithm  based  on  Lipschitz  quotients  is  applied  to  the 
data  [He  and  Asada,  1993].  The  algorithm  is  written  as  a  function  in  Matlab.  The  function 
returns  the  index  vector  to  be  used  for  order  estimation  in  the  range  of  specification.  This 
function  requires  extensive  computation.  Hence,  an  interactive  function  is  also  written  to 
compute  Lipschitz  quotients  only  at  user  specified  orders. 

The  results  can  be  seen  in  Figure  2.  The  numbers  in  the  bracket  means  input  delay,  order  of 
input  regression  vector,  and  order  of  the  output  regression  vector.  In  other  words,  \nk  nv  ny  J  for 

a  model  y{t)  =  f{y(t -1),- ,y(t-ny),u(t -w*)," -,u(t -«„))•  ?rom  g”1]**  we  can  see  ^ 
the  curve  converges  after  [l  2  2].  Hence,  it  will  be  a  reasonable  estimate  of  the  regression  order. 

Preliminary  Order  Estimation 
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It  is  well  known  that  pre-processing  training  data  can  help  the  identification  process. 
However,  it  is  noticed  that  crude  normalization  can  actually  damage  the  quality  of  model.  In 
Figure  3,  the  adverse  effect  of  normalization  of  data  is  shown.  Two  data  sets,  one  normalized  by 
(x  -  Tmax ' )/ ("rc*  T™”  'r )  and  the  other  not  normalized,  are  used  to  identify  the  system  with  linear 
ARX  models.  The  correlation  of  residual  shows  that  estimation  is  biased  because  of  the 
normalization.  Also,  the  adverse  effect  of  normalization  seems  significant  for  non-white  noise 
cases.  As  a  result,  no  normalization  was  applied  for  ARX  models. 


Correlation  function  of  residuals.  Correlation  function  of  residuals.  Output  #  1 


Cross  corr.  function  between  input  and  residuals 


Cross  corr.  function  between  input  and  residuals 


lag 

(b)  ARX  system 


Figure  3:  The  effect  of  normalization  of  data 


For  Self-Organizing  Map  (SOM)  based  multiple  modeling,  normalization  is  essential  because 
SOM  depends  on  the  geometric  property  of  data  points.  Without  normalization,  SOM  does  not 
cover  the  regression  space  well.  The  designs  of  all  four  models  considered  are  realized  off-line. 
The  normalized  data  is  used  for  the  feedforward  neural  networks  (FFNN)  based  ARX  model  and 
the  SOM  based  multiple  model  while  the  raw  data  set  is  used  for  the  ARX  and  multiple  models 
by  multiple  Laguerre  bases.  Training  of  the  ARX  model  was  very  efficient  while  the  FFNN  and 
SOM  models  required  several  trial  and  error  to  find  reasonable  models.  It  was  surprising  that  the 
FFNN  model  was  no  better  than  other  models  in  spite  of  the  general  belief  that  neural  networks 
are  good  at  approximating  functions  [Narendra  and  Parthasarathy,  1990].  The  design  parameters 
such  as  the  number  of  hidden  neurons  in  neural  networks,  the  number  of  neurons  in  the  SOM 
model  and  the  number  of  Laguerre  poles  in  the  proposed  method  are  all  selected  ad  hoc  since 
there  are  no  existing  design  guidelines.  The  simulated  response  of  each  model  with  the  training 
data  set  is  shown  in  Figure  4  and  with  the  validation  data  set  in  Figure  5.  Average  RMS  values  of 
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simulated  errors  for  10  runs  on  each  model  are  listed  in  Table  1.  Notice  that  the  proposed 
algorithm  produced  competitive  performance  compared  to  other  three  models. 


Feedforward  Neural  network  Dased  ARX  model 
(hidden  neurons  =  10) 


m 


)  1050  1100  1150  1 

time  sequence 

Self  Organizing  Map  based  Multiple  Model 
.  (number  of  local  model  =  6) 


1050  1100  1150  i; 

time  sequence 

Multiple  model  by  multiple  Laguerre  bases 
(number  of  local  models  »  3) 


Table  1:RMS  error  of  each  model  for  system  (i) 


Model 

training  RMS  error 

generalization  RMS  error  jj 

ARX  model 

2.8030 

.  3.2258  1 

Neural  network  based  ARX  model 

2.7365 

3.3841 

Self-Organizing  map  based  multiple  model 

2.8726 

3.5082 

Multiple  model  by  multiple  Laguerre  bases 

2.7861 

3.2256  1 

3.2  Identification  of  a  nonlinear  discrete  time  system 

After  the  assuring  results  of  system  (i)  that  these  algorithms  can  identify  the  unknown  system 
to  a  certain  extent,  a  more  complex  and  nonlinear  System  is  tested.  The  data  is  collected  with  the 
input  ranging  from  -2.5  to  2.5  and  measurement  noise  of  variance  0.1.  The  procedure  of 
identification  was  similar  to  system  (i).  The  preliminary  order  estimation  method  by  Lipschitz 
quotients  is  applied  and  the  result  is  shown  in  Figure  6.  The  graph  shows  that  [l  3  2]  may  be  a 
reasonable  choice. 


Preliminary  Order  Estimation 


Similar  to  the  system  (i),  training  of  the  linear  ARX  model  was  very  efficient.  The  training  of 
the  neural  network  model-  was  not  slow  by  using  the  efficient  Levenberg-Marquaidt  algorithm, 
however,  the  response  was  no  better  than  that  of  the  linear  ARX  model  and  was  worse  in 
generalization.  In  contrast  to  the  disappointing  results  of  a  global  technique,  multiple  model 
approaches  were  efficient  in  training  and  the  generalization  results  were  excellent.  Average  RMS 
values  of  simulated  errors  for  10  runs  on  each  model  are  listed  in  Table  2.  As  shown  in  Table  2, 
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the  proposed  method  produced  the  smallest  error  in  the  training  and  generalization  sets.  In 
Figures  7  and  8.  the  results  from  training  set  and  generalization  set  are  shown. 


ARX  model 


time  sequence 

Self-Organizing  Map  based  Multiple  Mode! 
{number  of  local  models  =  12} 


lime  sequence 


Feedforward  neural  network  based  ARX  mode! 
(hidaen  neurons  =  20) 


time  sequence 


Multiple  model  by  multiple  ILaauerre  bases 
(number  of  local  models  =  7) 


time  sequence 


Figure  7:  Training  results  of  system  (ii) 

(dotted  line:  output  of  training  data,  solid  line:  simulated  output) 


ARX  model 


time  sequence 


Feedforward  Neural  network  based  ARX  model 
(hidden  neurons  =  20) 


time  sequence 


Self-Organizing  Map  based  Multiple  Model 
(number  of  local  models  -  12) 


time  sequence 


Multiple  moel  by  multiple  Laguerre  bases 
(number  of  local  models  &  7) 


Figure  8:  Generalization  results  of  system  (ii) 

(dotted  line:  output  of  training  data,  solid  line:  simulated  output) 
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Table  2:RMS  error  of  each  model  for  system  (ii) 


!  Model 

training  RMS  error 

generalization  RMS  error 

ARX  model 

1.6322 

1.6292 

1  Neural  network  based  ARX  model 

1.6956 

1.9823 

!  Self-Organizing  map  based  multiple  model 

[  1.5675  | 

!  1.7222 

Multinle  model  by  multiple  Laguerre  bases 

|  1.5024  i 

i  1.5074 

4.  Conclusion 

This  study  was  motivated  by  the  vision  to  realize  a.  practical  data-driven  control  system 
comparable  to  conventional  model  based  methods.  The  main  theme  was  to  adopt  a  multiple 
model  approach  instead  of  a  global  one.  By  this  adoption,  we  can  relieve  the  computational 
burden  significantly  as  well  as  maintain  the  mathematical  tractability.  Also,  this  approach 
enables  us  to  take  advantage  of  the  existing  methods  such  as  linear  system  identification  and 
many  of  estimation  techniques. 

The  proposed  algorithm  takes  advantage  of  the  properties  of  orthonormal  basis  functions 
which  resulted  in  simple  derivation  of  Laguerre  pole  estimation.  By  this,  we  can  relieve  the 
difficulty  of  estimating  the  order  of  regression  vectors  and  also  achieve  efficient  training  with 
maintained  mathematical  tractability.  The  simulation  results  demonstrate  the  characteristics  of 
the  algorithm.  Three  other  models  were  also  considered:  linear  ARX  model,  feedforward  neural 
networks  based  ARX  model  and  SOM  based  multiple  models.  Even  though  the  simulation 
results  do  not  demonstrate  significant  improvement  over  other  identification  methods,  the 
proposed  identification  algorithm  will  be  truly  beneficial  when  it  is  used  in  connection  with 
robust  controller  design.  The  argument  is  well  supported  since  there  is  a  great  interest  in  robust 
control  community  in  developing  robust  control  theory  for  linear  time  variant  systems.  The 
proposed  identification  model  will  be  ideal  for  robust  controller  design  since  it  will  preserve 
proper  characteristics  for  robust  control  such  as  reasonable  model  complexity  and  facilitation  of 
estimating  model  error  bounds. 
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